ArticlePDF Available

Empirical evaluation and study of text stemming algorithms

December 2020
Artificial Intelligence Review 53(4)

December 2020
53(4)

DOI:10.1007/s10462-020-09828-3

Authors:

Abdul Jabbar

Virtual University of Pakistan

Sajid Iqbal

Bahauddin Zakariya University

Shafiq Hussain

University of Sahiwal

Show all 5 authorsHide

Text stemming is one of the basic preprocessing step for Natural Language Processing applications which is used to transform different word forms into a standard root form. For Arabic script based languages, adequate analysis of text by stemmers is a challenging task due to large number of ambigious structures of the language. In literature, multiple performance evaluation metrics exist for stemmers, each describing the performance from particular aspect. In this work, we review and analyze the text stemming evaluation methods in order to devise criteria for better measurement of stemmer performance. Role of different aspects of stemmer performance measurement like main features, merits and shortcomings are discussed using a resource scarce language i.e. Urdu. Through our experiments we conclude that the current evaluation metrics can only measure an average conflation of words regardless of the correctness of the stem. Moreover, some evaluation metrics favor some type of languages only. None of the existing evaluation metrics can perfectly measure the stemmer performance for all kind of languages. This study will help researchers to evaluate their stemmer using right methods.

Summary of state-of-the-art stemmer's evaluation metrics

…

ICF values of Lovins and Porter stemmers

…

Stemmer's performance in the IR system (Karaa 2013)

…

Comparison of Gold standard, Frakes and Sirsat's evaluation parameters

…

Classification of stemming algorithms in terms of types

…

Figures - uploaded by Sajid Iqbal

Content may be subject to copyright.

Content uploaded by Sajid Iqbal

Content may be subject to copyright.

Vol.:(0123456789)

Artiﬁcial Intelligence Review

https://doi.org/10.1007/s10462-020-09828-3

1 3

Empirical evaluation andstudy oftext stemming algorithms

AbdulJabbar1· SajidIqbal2 · ManzoorIlahiTamimy1· ShaqHussain3·

AdnanAkhunzada1

Abstract

Text stemming is one of the basic preprocessing step for Natural Language Processing

applications which is used to transform diﬀerent word forms into a standard root form. For

Arabic script based languages, adequate analysis of text by stemmers is a challenging task

due to large number of ambigious structures of the language. In literature, multiple perfor-

mance evaluation metrics exist for stemmers, each describing the performance from par-

ticular aspect. In this work, we review and analyze the text stemming evaluation methods

in order to devise criteria for better measurement of stemmer performance. Role of diﬀer-

ent aspects of stemmer performance measurement like main features, merits and shortcom-

ings are discussed using a resource scarce language i.e. Urdu. Through our experiments

we conclude that the current evaluation metrics can only measure an average conﬂation of

words regardless of the correctness of the stem. Moreover, some evaluation metrics favor

some type of languages only. None of the existing evaluation metrics can perfectly meas-

ure the stemmer performance for all kind of languages. This study will help researchers to

evaluate their stemmer using right methods.

Keywords Natural language processing· Information retrieval· Text mining· Stemming

algorithms· Stemmer evaluation methods· Urdu stemming

* Sajid Iqbal

sajidiqbal.pk@gmail.com

Abdul Jabbar

a.jabbar73@gmail.com

Shaﬁq Hussain

shaﬁqhussain@bzu.edu.pk

Adnan Akhunzada

akhunzadaadnan@gmail.com

1 Department ofComputer Science, COMSATS University Islamabad (CUI), Main campus, Park

Road, Tarlai Kalan, Islamabad45550, Pakistan

2 Department ofComputer Science, Bahauddin Zakariya University Multan, Multan, Punjab,

Pakistan

3 Bahauddin Zakariya University Multan (Sahiwal Sub-Campus), Multan, Punjab, Pakistan

A.Jabbar et al.

1 3

1 Introduction

Performance evaluation is the primary method to ﬁnd the eﬀectiveness of algorithms and

methods developed to solve diﬀerent scientiﬁc problems. Eﬃcient evaluation methods can

boost the applicability of the solution. Performance evaluation methods describe and deter-

mine the extent to which the solution can achieve its intended goals.

It is an open challenge for computational linguistics researchers to assess the perfor-

mance of NLP applications (Cambria and White 2014). Existing evaluation methods can

be divided between two main categories: intrinsic and extrinsic methods (Gaidhane etal.

2015). In the intrinsic evaluation, performance measure of NLP applications and methods

are compared with some gold standard results that are calculated using manual methods.

For instance, a stem produced by a stemmer is compared with relevant dictionary stem

developed by human experts. Whereas, in the extrinsic evaluation method, the performance

is measured directly in any realistic scenario. For example, two stemmers are tested on the

same dataset and achieved accuracy is compared in relative way.

An evaluation of the text stemming system has always a long debate (Brychcín and

Konopík 2015). In past studies, various text stemming evaluation (TSE) methods are used

by researchers. However, the manual assessment of stemmers require human eﬀort that

make it a challenging task (Suryani etal. 2018). The state-of-the-art TSE methods can be

direct or indirect (Singh and Gupta 2016). Direct evaluation refers to text stemming error

analysis, conﬂation ratio, and compression factors as mentioned by Abainia etal. (2017).

Indirect evaluation refers to stemmer evaluation with respect to speciﬁc NLP applica-

tion. Such methods usually use machine learning (ML) methods like K-Nearest Neighbor

(KNN), Naïve Bayesian (NB) and Support Vector Machines (SVM) are used for text clas-

siﬁcation (Saeed etal. 2018a, b). Presently neural network based methods are showing bet-

ter performance than tranditional ML methods. Performance evaluation methods used in

Information Reterival (IR) domain are considered as indirect methods (Mustafa and Rashid

2018).

To perform stemming in diﬀerent langauges, various stemmers have been proposed

(Jabbar etal. 2018a, b). To evaluate a stemmer from multiple aspects, it is required to eval-

uate it through diﬀerent methods i.e. direct and indirect ways. Direct evaluation methods

only consider the conﬂation ratio for performance measurement (Brychcín and Konopík

2015; Sirsat etal. 2013). These methods measure their performance based on correct stem-

ming ratio. They did not focus on false positives, false negatives and untouched words.

This shortcoming leads to limited applicability of designed stemmers. In this work, we

address and review state of the art TSE methods. We experimently prove that current eval-

uation methods provide a partial picture of stemmer performance.

The contributions of this work are three fold. Firstly, we present an extensive compara-

tive study of existing stemmer evaluation methods to highlight their merits and demerits.

Secondly, we perform various experiments to ﬁnd the performance of diﬀerent stemmers

designed for Urdu language. Thirdly and lastly, through our experiments, we show that

current evaluation methods provide partial view and some of the evaluation methods are

language speciﬁc.

Remaing part of this paper is organized as follows. Section2 provides the background

of this study. In Sect.3, diﬀerent text stemming applications are discussed. Diﬀerent stem-

ming evaluation methods are described in Sect.4. A deep analysis of evaluation methods is

given in Sect.5. In Sect.6, we discuss challenges associated with text stemming evaluation

Empirical evaluation andstudy oftext stemming algorithms

1 3

methods and future research directions in this context. Finally, the ﬁndings and conclusion

is provided in Sect.7.

2 Background

Stemming is a computational procedure in which aﬃxes are truncated from conﬂated word

to extract the root. This process may diﬀere from application to application based on their

role in that application. In text stemming, stemmers minimize the document index size to

improve the eﬃcienty of computation. An index of a document containing English words

such as “accepts”, “accepted” and “acceptance” can be mapped to one common root i.e.

“accept”. Text stemming is an integral part of many NLP applications and an evaluation

of these systems is a very crucial and tedious task (Dahab etal. 2015). This section pre-

sents necessary deﬁnitions and descriptions, required to understand stemming evaluation

process.

Morpheme Morpheme is the smallest grammatical unit of a language that cannot be fur-

ther divided into smaller meaningful parts and are combined together to form meaningful

words. This combination could be through inﬂection, derivation, and composition (Aronoﬀ

and Fudeman 2011). For example, a morpheme may consist of a word such as “accept-

ance” in which “accept”, is a meaningful piece of a word, whereas; “ance” is a meaningless

morpheme. It cannot be further divided into smaller meaningful parts.

Aﬃx Aﬃx is a morpheme that can be deﬁned as word or letters which are attached to

the root or stem on any position i.e. it may be at the end /start /on both sides of the word or

anywhere in the middle of the word (Aronoﬀ and Fudeman 2011). Aﬃxes are used to pro-

duce inﬂections and derivative forms of a word. For instance “accepts”, “accepting” and

“accepted” in which “s”, “ing” and “ed” are aﬃxes and stem is “accept”.

Inﬂectional and derivational aﬃxes Inﬂectional morphemes refer to the modiﬁcation of

words and that change the grammatical categories such as tenses, singulars, plurals, mas-

culine, feminine and neutrals. The derivation is the reverse of inﬂection; it constructs new

words by adding aﬃxes to a root word (Qureshi etal. 2018).

Stem It is a base morpheme to which other morphemes like aﬃxes are attached (Aronoﬀ

and Fudeman 2011).

Root A root is like a stem, but it contains only two morphologically simple units. For

instance, ‘disagree’ is the stem of ‘disagreement’ because it is the base morpheme to which

‘ment’ aﬃx is attached, but ‘agree’ is the root (Aronoﬀ and Fudeman 2011). With this deﬁ-

nition, we will be using both “stem” and “root” alternativilly. If required, the context will

clear the particular meanings.

Lemma and Lexeme.

Lexemes indicate a common morpheme in a variant form of a word. On the other hand,

a lemma is a deﬁnite form that is chosen from the lexemes collection to characterize the

lexeme. The lemma is a valid dictionary word (Singh and Gupta 2016). For instance, the

words ‘write’, ‘writing’, ‘wrote’, ‘writes’ and ‘written’ are lexeme and ‘write’ is a lemma.

2.1 Overview ofstemming algorithms

In literature, researchers have categorized stemming algorithms in diﬀerent ways. Al-Sug-

haiyer and Al‐Kharashi (2004) studied stemmers for English and Arabic languages and

A.Jabbar et al.

1 3

have classiﬁed the English stemmers into three main categories i.e. table lookup, linguis-

tics, and computational. They classiﬁed Arabic stemmers into four groups i.e. table lookup,

linguistics, computational and pattern based. According to Paik et al. (2011), stemmers

can be categorized as rule-based and statistical stemmers. Jivani (2011) examined the Eng-

lish stemmers and classiﬁed them into three subcategories i.e. truncating, statistical and

mixed. Zhou etal. (2012) divided the stemmers into two main categories (i.e., rules-based

stemmers, and statistical stemmers). Moghadam and MohammadReza (2015) described

the Persian stemmers and classiﬁed them into three classes: structural stemmers, table

lookup stemmers and statistical stemmers. Singh and Gupta (2017) divided the stemming

approaches into linguistic rule-based methods and language independent/statistical meth-

ods. Jabbar etal. (2018a, b) classiﬁed the stemmers into three classes: (a) linguistic-based

stemmers, (b) corpus-based stemmers, and (c) hybrid stemmers. In general, the stemming

algorithms can be classiﬁed as linguistic-based and computational-based stemmers as

shown in Fig.1. In linguistic-based stemmers, handcrafted grammatical rules are designed

to derive the stem. For instance Saeed etal. (2018a, b) presented rule-based stemmer for

the Kurdish language, and Suryani et al. (2018) developed Sundanese rule-based stem-

mer. On the other hand, computational stemmer performs some statistical or non-statisti-

cal computations. Corpus-based stemming measures the co-occurrence of variants word

forms [e.g., Alotaibi and Gupta (2018)]. Similarly, in statistical stemmers, researchers have

applied statistical and machine learning based procedures/techniques to extract the stem

(Bölücü and Can 2019; Pande etal. 2018).

A good number of rule based stemmers for English language are proposed in litera-

ture (Lovin 1968; Porter 1980). Many researchers have attempted to improve the perfor-

mance of Marting Porter’s Stemmer (Bimba etal. 2016), due to its eﬀectiveness for many

NLP applications (Chintala and Reddy 2013; Patil and Patil 2013). Examples of rules and

resource development for rule based stemmers are given by Jabbar etal. (2016) where the

authors have developed some resources for Urdu language processing. According to table

lookup/ references lookup (also called brute force approach), the word and correspond-

ing stem are saved in the form of a table (Hussain et. al. 2017). This approach is suitable

for those languages which have very complex linguistic structures. In statistical stemmers,

several statistical features are extracted from given dataset. By using these features, the

stem of the query word is obtained using statistical classiﬁers. Statistical methods are also

known as language independent methods. Hybrid stemmers are constructed by combining

two or more stemming methods. For instance, Bimba etal. (2016) stemmer combined the

Types of stemmers

Computational-basedLinguistic-based

Corpus basedStatistical

Rule based/

Affix striping stemmers

Template basedTable lookup

Fig. 1 Classiﬁcation of stemming algorithms in terms of types

Empirical evaluation andstudy oftext stemming algorithms

1 3

rule base and table lookup approaches for the Hausa language. Jabbar etal. (2018a, b) also

constructed an Urdu stemmer using a hybrid approach.

2.2 Text stemming errors

Recognizing the types of errors, a stemmer may produce, is the ﬁrst step to measure the

eﬀectiveness of given stemmer. These types of errors can help to ﬁnd the answers of ques-

tions like when and why they are occurred and what is their aﬀect on stemmer perfor-

mance? Fig.2 gives the categorization of stemming errors.

Under stemming errors (USE) It refers to the fact when a stemmer strips the letters

under acceptable level. In this type of errors, the stemmer either produces the word as

it (no stem) or the process of aﬃx removal generate the word with changed meaning as

shown in Table1.

Over stemming errors (OSE) OSE is an error in which a stemmer truncates more char-

acters than required. OSE error leads toward invalid stem or out of vocabulary (OOV) word

(Table2).

Mis-stemming errors (MSE) The term “Mis-stemming errors” refers to those errors in

which the stripped characters do not make proper aﬃx (Table3).

Text stemmin

errors

Miss Stemming Over Stemming

Invalid Word Change WordNo Stem Under stem InvalidStem Over Stem

Under Stemming

Fig. 2 Overview of stemming errors

Table 1 Examples of under

stemming errors Input word Actual stem Produced stem Types of error

Acceptance Accept Acceptance No stem

Acceptances Accept Acceptance Under stem

Table 2 Example of over

stemming errors Input word Actual stem Produced stem Types of error

receiving receive receiv Invalid stem

consistently consist consistent Over stem

Table 3 Example of mis-

stemming errors Input word Actual stem Produced stem Types of error

Red Red r Invalid word

kneel Kneel knee Change word

A.Jabbar et al.

1 3

3 Applications ofstemming algorithms

A stemming is a morphological analysis and necessary preprocessing step for NLP appli-

cations (Hassani and Lee 2016). To model a language, researchers extract diﬀerent type

of features (manual or automatic) from given data (Brychcín and Konopík 2015). There

are variety of NLP applications and each type of application may require diﬀerent type

of features. For example, a language expert may utilize stemmer for vocabulary learn-

ing and development (Mochizuki and Aizawa 2000). Sarma and Purkayastha (2013) and

Dang et. al. (2013) have used stemming for word classiﬁcation and wordnet development

respectivelly. Domain speciﬁc words extraction is performed by Rehman et. al. (2013) and

Nguyen and Leveling (2013). Vocabalry mismatch problem can also be solved with the

help of stemming (Singh and Gupta 2016). Applications like Information Extraction (IE),

Information Reterival (IR), Text Classiﬁcation (TC), Text Clustering (TClu), Question

Answering (QA), Text Summarizations (TS), Machine Translation (MT), Text Segmenta-

tion (TS), Indexing (Ind), Automatic Speech Recognition (ASR) (Dahab etal. 2015; Singh

and Gupta. 2016) and language generation (Mishra and Prakash 2012) require stemming

as preprocessing step. In short, stemming improves the performance by reducing time and

space complexity for several NLP applications (Boudchiche and Mazroui. (2015). A sum-

marized view of applications of text stemming systems is provided in Table4.

4 Stemming evaluation methods

In this section we review the representative stemming evaluation methods. To the best of

our knowledge, we have included all available stemming evaluation methods and are sum-

marized in Table5.

Text stemming evaluation methods can be categorized as direct, indirect and gold stand-

ard evaluation metrices as shown in Fig.3.

4.1 Direct evaluation methods

These evaluations methods used training datasets to extract required statistics known as

features to perform stemming. These statistics may include over stemming index (OI),

under stemming index (UI), Index compression factor (ICF), Average Words Conﬂation

Factor (AWCF), and few more statistical signiﬁcance metrics. In this subsection, we ana-

lyze all these diﬀerent direct evaluation methods and make a discussion on our analysis.

4.1.1 Paice’s (1994, 1996) evaluation methodology

Paice (1994, 1996) proposed ﬁrst stemmer which is based on error counting and prede-

ﬁned groups of words that are related to each other either morphologically or semantically.

A good stemmer conﬂates as many words as possible in a predeﬁned group and avoids

conﬂating diﬀerent class words or words that are semantically distinct. Using this method,

performance of proposed stemmer is measured with under-stemming index (UI), over-

stemming index (OI) parameters, their ratio and the stemming weight (SW). To determine

Empirical evaluation andstudy oftext stemming algorithms

1 3

Table 4 State-of-the-art applications of text stemming algorithms

No. 1 Cited Application Description

1 Schoﬁeld and Mimno. (2016), Hassani and Lee (2016) Language modeling Stemming may be viewed as a system of smoothing and as a

way of better statistical estimation

2 Schoﬁeld and Mimno (2016) Topic modeling Stemmers can reduce the vocabulary size and topic mod-

eling depends upon sparse vocabulary. So, it is also lever-

aged in topic modeling as a preprocessing step

3 Boukhalfa etal. (2018) Plagiarism detection Stemming enhances the performance of the similarity detect-

ing system

4 McCormick (2016) Word embedding Two words have similar context must have the same word

vector such as “ant” and “ants” have a similar context that

is possible when a stemmer stem “ants” to “ant”

5 Sarma and Purkayastha (2013) Word classiﬁcation Stemming also improves the eﬃciency of word classiﬁcation

applications

6 Dang etal. (2013) WordNet development Stemming deals with a variant form of words and each form

of word belongs to a speciﬁc part of speech that is helpful

in WordNet development

7 Nguyen and Leveling (2013) Domain-speciﬁc words extract from the text Domain-speciﬁc words have a speciﬁc form or speciﬁc aﬃx

and stemming facilitates to identify these words form or

aﬃx

8 Saeed etal. (2018a, b)

Ismailov etal. (2016)

Karimi etal. (2015)

Text mining The goal of text mining is to extract meaningful information

from text data

9 Dey etal. (2014) Named entity recognition (NER) Named entity recognition (NER) system seeks and extract

the predeﬁned proper and common nouns entities from the

natural language text

10 Rehman etal. (2013) Word segmentation Stemming can be viewed as word tokenization from the

continuous text

11 Dahab etal. (2015), Singh and Gupta (2016) Information extraction (IE) Stemming also improves the eﬃciency of an information

extraction system

12 Flores and Moreira (2016), de Oliveira and Junior (2018) Information retrieval (IR) IR system uses a variant form of the words via stemmer

A.Jabbar et al.

1 3

Table 4 (continued)

No. 1 Cited Application Description

13 Rani etal. (2015), Ali etal. (2018), Saeed etal. (2018a, b) Text classiﬁcation (TC) Stemming can be viewed as text classiﬁcation mechanism

because it groups the words that share the same morpho-

logical root

14 Khalid etal. (2016), Dahab etal. (2015) Text clustering (TClu) A bag of words can be grouped by the stemming system

15 Giachanou and Crestani. (2016), Yadollahi etal. (2017) Sentiment analysis Pre-processing includes recognition and deletion of stop

words, slangs, abbreviations, stemming and correction

16 Khalid etal. (2016) text compression Text stemming compresses the vocabulary by reducing

conﬂicted words to their common root form

17 Dahab etal. (2015), Singh and Gupta (2016) Question answering (QA) A variant form of questioning words stems that enhances the

performance of the QA system

18 Dahab etal. (2015), Singh and Gupta (2016) Text summarization (TS) Variant forms of words with diﬀerent meanings can be

reduced to their common root. It makes easy for TS system

to perform better

19 Fattah etal. (2006) Machine translation (MT) It is the variant form of words which are not present in the

tagset language. Subsequently, in such cases, stemmer

provides the stem that is helpful for translation

20 Rashid and Mohamad (2016) Detecting wicked website In wicked information ﬁltering and detecting wicked web-

site, text stemming is used to extract features that ultimatly

improve the performance of the system

Empirical evaluation andstudy oftext stemming algorithms

1 3

Table 5 Summary of state-of-the-art stemmer’s evaluation metrics

No Wor k Stemming method Languages Dataset size Evaluation methods

1 Mishra and Prakash (2012) Rule-based Hindi 2265 words Manual

2 Ababneh etal. (2012) Rule-based Arabic Sample terms list Manual

3 Karaa (2013) Rule-based English 30,000 words Paice (1994)’s evaluation method

4 Thangarasu and Manavalan

(2013)

Statistics (cluster analysis) Tamil 7,000 words Manual

5 Husain etal. (2013) Statistics (N-gram) Urdu and Marathi 1,200 Urdu words

1,200 Marathi words

Manual

6 Sulaiman etal. (2014) Rule-based Malay 1,200 words Paice (1994)’s Evaluation

Method

7 Abu-Errub etal. (2014) Rule-based Arabic 1100 words Manual

8 Al-Omari and Abuata (2014) Rule-based (linguistic and

mathematics rules)

Arabic 6,225 words Manual

9 Rashidi and Lighvan (2014) Hybrid Persian Small data from Hamshahri

collection

Manual

10 Dianati etal. (2014) Corpus base approach Persian 1,250 Persian words Manual

11 Al-Kabi etal. (2015) Rule-based and pattern base Arabic 6,081 words Manual

12 Khan etal. (2015) Rule-based and template

matching

Urdu 66,200 words Precision, recall and F-measure

13 Brychcín and Konopík (2015) Statistical Czech, Slovak, Polish, Hungar-

ian, Spanish and English

languages

Large date set for each lan-

guage

Precision, recall and F-measure

14 Bimba etal. (2016) Rule-based Hausa language 1,723 words Paice (1994)’s evaluation method

Sirsat’s evaluation method (Sirsat

etal. 2013)

15 Momenipour and Keyvanpour

(2016)

Statistical Persian PER-Tree-Bank words

Bijankhan distinct words

Hamshahri test collection

Manual, precision, recall

16 El-Defrawy etal. (2016) Rule-based Arabic International Corpus of Arabic

(ICA)

Precision, recall and F-measure

A.Jabbar et al.

1 3

Table 5 (continued)

No Wor k Stemming method Languages Dataset size Evaluation methods

17 Abainia etal. (2017) Rule-based Arabic ARASTEM data setaPaice (1994)’s evaluation meth-

odology

18 Taghi-Zadeh etal. (2015) Hybrid Persian 4,689 words

26,913 words

Manual

19 Singh and Gupta (2017) Statistical English

Marathi

Hungarian

Bengali

173,252 WSJ documents

(English)

99,275 documents (Marathi)

Magyar

Hirlap collection of 49,530

documents (Hungarian)

FIRE

2010 collection containing

123,047 documents (Bengali)

Precision, recall and F-measure

20 Mateen etal. (2017) Hybrid Punjabi 85,152 words Manual

21 Jaafar etal. (2017) Rule-based Arabic Quranic Arabic Corpus Frakes and Fox (2003) Evalua-

tion mechanism

22 Jabbar etal. (2018a, b) Hybrid Urdu 76,074 words Precision, Recall and F-measure

Frakes and Fox (2003)

23 Alotaibi and Gupta (2018) Statistical English

Marathi

Hungarian

Bengali

WSJ documents (English)

Sakal and Maharashtra Times

(Marathi)

Magyar Hirlap corpus (Hungar-

ian)

FIRE

2010 collection (Bengali)

Precision, Recall and F-measure

24 Suryani etal (2018) Rule-based Sundanese 4,453 words Paice (1994)’s evaluation meth-

odology

25 Saeed etal. (2018a, b) Rule-based Arabic 4007 documents Precision, recall and F-measure

26 Ali etal. (2019) Rule-based Urdu 32,000 words Manual

Empirical evaluation andstudy oftext stemming algorithms

1 3

Table 5 (continued)

No Wor k Stemming method Languages Dataset size Evaluation methods

27 Bölücü and Can (2019) Statistical Turkish, Hungarian, Finnish,

Basque, and English

Turkish = 5,620 sentence and

53,798 tokens

Hungarian = 24K words

Finnish = 19,000 sentences and

1,62,000 words

Basque = 24K words

English = 24K words

Frakes and Fox (2003) evaluation

mechanism

a https ://abain ia.net

A.Jabbar et al.

1 3

these values, a list of groups of semantically and morphologically related words are formed

and then submitted to the stemmer. A stemmer commits under stemming error if it pro-

duces more than one unique stems for the same group or class of words. On the other hand

if produced stem of one group also occurs in another group (same stem is produced for two

groups of words) then the stemmer has committed over-stemming error. An ideal stemmer

conﬂates this group to the same stem and has low UI and OI indexes. UI and OI can be

calculated using Eqs. (1) and (2). Following four parameters are used to calculate over-

stemming and under-stemming indexes.

Global Desired Merge Total (GDMT)

Global Desired Non-Merge Total (GDNT)

Global Unachieved Merge Total (GUMT)

Global Wrongly Merged Total (GWMT)

These parameters are deﬁned in following section.

Under stemming index (UI) The Desire Merges Total (DMT) represents the total

number of word forms in the group, and it can be calculated by the Eq.(3)

where

= possible number of morphological forms in particular group having same stem

GDMT is equal to the sum of DMT values for all word groups as in Eq.(4)

Unachieved Merge Total (UMT) represents the failure of a stemmer to merge all

query words to the same root in a speciﬁc group. Unachieved Merge Total (UMT),

represents the total number of distinct stems in a group that are produced by the stem-

mer, and it can be calculated as follows in (Eq.5)

(1)

Under Stemming Index

(UI)=

GUMT

GDMT

(2)

Over Stemming Index

(OI)=

GWMT

GDNT

(3)

DMT

ng(ng−1

)

(4)

GDMT

∑

i=1

DMT

(5)

UMT



i=1

uing−ui



Text Stemming Evaluation Methods

Direct methods

Examples:Paice evaluation method,

Sirsat’s evaluation method

Indirect methods

Examples: Evaluation of stemmer using IR

systems (precision, recall and F-measure)

Gold standard

Manual methods

Fig. 3 Classiﬁcation of text stemming evaluation methods

Empirical evaluation andstudy oftext stemming algorithms

1 3

where

= Number of distinct stems produced by stemmer (a stemmer may produce multi-

ple stems for a particular group of words),

= possible number of morphological forms in

particular group having same stem,

= ith stem produced by stemmer

GUMT can be calculated as sum of UMTs for each group using Eq.(6)

where

= Total number of groups in given corpus

Finally, the under-stemming index (UI) can be deﬁned as given in (Eq.1):

Over steaming index (OI) Wrongly Merged Total (WMT) represents the over stem-

ming when words from two diﬀerent groups are stemmed to one root. It can be calcu-

lated using (Eq.7):

where N = total number of groups involved in correct and wrong stemming,

= number of

words in ith group,

vij

= Number of stems that should actually belong to group i but pro-

duced in group j.

In Eq. (7), we can see that

vij

is the number of stems that belong to group

and

stemmer has produced in group

. If

i=j

, the stemmer has performed the job correctly

wheras in case

i≠j

, wrong stemming has been performed. In short, it tells that the

stemming process for a particular group of morpholical variants is interfering with other

groups.

Global Wrongly Merged Total (GWMT) is obtained by summing the WMT for all the

groups by Eq.(8).

Desired Non-Merge Total (DNT) refers to the number of words in a certain group that

can be conﬂated after stemming with words from some other semantic group. DNT can be

calculated by the Eq.(9)

where

= Total number of words in the test dataset,

= Total number of words in particu-

lar group

The Global Desired Non-Merge (GDNT) is equal to the sum of DNT for all the groups

and can be calculated by Eq.(10)

Hence, the Over-Stemming Index (OI) can be calculated by Eq. (2)

Stemming weight (SW) The SW parameter measures the strength of a stemmer. Lower

value of SW identiﬁes weak stemming whereas higher value indicates the strong stem-

ming. It can be calculated using Eq. (11).

(6)

GUMT

∑

i=1

UMT

(7)

WMT

g=1



i,j=1

vijni−vij =v11 ,v12,v13 ,v21 ,v22,v23 ,v31 ,v32,v33



(8)

GWMT

∑

i=1

WMT

(9)

DNT

(

w−ng

)

(10)

GDNT

∑

i=1

DNT

A.Jabbar et al.

1 3

Paice (1994) utilized the ClSI source (CISI Collection, University of Glasgow) which

contain 184,659 words and the authors extracted two smaller word samples of size 1,527

distinct words. The author experimented with Lovin’s (1968), Porter’s (1980) and Paice/

Husk’s (1990) stemmers and concluded that Paice/Husk (Paice 1990) stemmer has the

highest value of OI index; on the contrary, rest of the two stemmers show the lowest score.

In the case of UI index, Porter’s stemmer has the highest under-stemming errors than oth-

ers. Paice/Husk (Paice 1990) has UI lowest score

[

1.21 ×10

−1]

and Porter has highest UI

[

3.74 ×10−1

]

whereas porter has lowest OI

[

2.18 ×10−5

]

and Paice/Husk (Paice 1990) has

highest OI

[

1.18 ×10

−4]

as mentioned in Table6. According to SW score

[

9.78 ×10

−4]

Paice/Husk (Paice 1990) is the strongest stemmer, Lovin’s (1968)’ stemmer scored SW is

[

1.93 ×10

−4]

and is at second place and Porter’s (1980)’s stemmer with SW

[

7.4 ×10

−5]

stands last as shown in Table6.

Discussion Paice’s Evaluation Methodology (PEM) has some problems. First, it is not

trivial to create groups of morphologically related words, and if a group contains only one

word then the value of DMT will be zero. Moreover, it is time-consuming because the

manual check to ﬁnd whether a resultant word is suﬀering from under-stemming or over-

stemming. This methodology is not suitable to check a large volume of data set. It deals

only with two types of stemming errors; however, a stemmer may commit some other

errors such as generation of invalid words i.e. generated stem does not lie in any group. In

some cases, the stemmer produces linguistically correct stem but incorrect in reality, as

 [two boys] by Khoja (1999) stemmer root  [soft], is derived, it is linguistically cor-

rect but not valid stem,  [give birth] is the correct stem (Nwesri and Alyagoubi 2015).

And ﬁnally, this method is suitable only for the English language (AlSerhan and Alqrainy

2008).

4.1.2 Hull’s evaluation method

In the situations where the performance of two stemmers is slightly diﬀerent from each

other. it is very hard to say that the performance variation is enough or not, or it just hap-

pens by chance. For such cases, Hull (1996) proposed Analysis Of Variance (ANOVA)

model for large, continuous and normally distributed sample size. It is observed that for

English language, word inﬂectional forms are low, and the observed diﬀerences are lim-

ited. Hull (1996) performed the experiments over ﬁve stemmers [Remove s stemmer,1

Lovins (1968),2 Porter (1980), Inﬂec and Deriv stemmers (Xer 1994)] using the SMART

text retrieval system originated at Cornell University (Buckley 1985). No stemming is used

(11)

Table 6 Results of Paice

evaluation methods Stemmers UI OI SW

Lovins 3.26 × 10−1 16.3 × 10−5 1.93 × 10−4

Paice/husk 1.21 × 10−1 1.18 × 10−4 9.78 × 10−4

Porter 3.74 × 10−1 2.18 × 10−5 7.4 × 10−5

1 Built in SMART system.

2 Extensively modiﬁed version Lovens (1968) included in SMART system.

Empirical evaluation andstudy oftext stemming algorithms

1 3

in order to index the queries and compare them with their standard form. Hull (1996) con-

cluded the average absolute improvement is smaller (up to 1–3%) in IR system only due to

stemming.

4.1.3 Frakes ‘s evaluation

The strength of stemmer indicates the degree of variation of the derived stem. Weak/light

stemmers conﬂate only highly related words such as “consist”, “consisted”, and “consist-

ing”. In contrast, strong/heavy stemmers can handle more variation in morphological forms

such as “consistency”, “consistent”, “consistently”. Frakes and Fox (2003) proposed a cri-

terion for determining the stemmer strength and similarity as described below.

The Mean number of words per conﬂation class (MWC) MWC refers to the mean

number of words conﬂated per class. For example, “consist”, “consisted”, “’consisting”

are conﬂated to “consist” which determine the value of MWC. In this case, it is three

words. Higher value of MWC signiﬁes the better performance of a stemmer. It is calcu-

lated using Eq.(12).

where N refers to the total number of unique words in a class and S refers to the number of

unique stems obtained.

Index compression factor (ICF) A higher ICF value signiﬁes the strength of the stem-

mer (Frakes and Fox 2003). Many experiments have proven that strong stemmers yield

higher ICF value. Lennon etal. (1981) achieved ICF

[30.9−45.8%]

on Lovin’s (1968)’

stemmer and 26.2–38.8% using Porter’s (1980)’s stemmer. Frakes and Fox (2003)

depicted higher 29% ICF and 17% on porter’s (1980) stemmer. Paice, (1994) and Har-

man (1991) proved Loven’s (1968) has higher ICF 44.60% and 38.38 respectivly than

porter’s (1980) stemmer as shown in Table7 last two columns.

The index compression factor is deﬁned in Eq.(13)

where,

= the number of words in the corpus,

= the number of stems

For example, a corpus with 50,000 words (n) and 40,000 stems (s) would have an

index compression factor of 20%.

The word change factor (WCF) WCF indicates the number of words that are left

unchanged by the stemmer. For example, a stemmer might not alter the word “consist”

as it is already a stem. Strong stemmers may often change such words than weaker stem-

mers. Normally, the higher value of WCF indicates the best stemmer. WCF can be cal-

culated in (Eq.14)

(12)

MWC

(13)

ICF

n−s

Table 7 ICF values of Lovins and Porter stemmers

Stemmer Lennon etal. (1988) Frakes and Fox

(2003)

Paice, (1994) Harman (1991)

Lovin’s (1968) From 30.9–45.8% 29% 44.60% 38.23%

Porter’s (1980) From 26.2–38.8% 17% 38.90% 28.74%

A.Jabbar et al.

1 3

where

parameter is the number of unique words and

is the number of unchanged

words.

The mean number of characters removed (MCR) It represents the average number of

characters (in a group) removed to derive the stem. The strong stemmer truncates more

letters than the weak stemmer to obtain the stem. As an example, for the word “helps”,

one character ‘s’ is removed, for “helper” two letters ‘er’ are removed, in case of the

word “helpful” three letters ‘ful’ are removed, similarly for “helpless” word four let-

ters ‘less’ are stripped to extract the stem “help”. Equation Eq.(15) computes the MCR

score.

Frakes and Fox (2003) used the Moby Common Dictionary wordlist3 to evaluate four

stemmers ["S" stemmer (Harman 1991), Lovins (1968), Porter (1980), Paice/Husk(1990)]

and claimed that Paice and Lovins stemming algorithms are the most similar, while the

Paice and "S" stemmers are the most dissimilar.

MCR only measures the strength of a stemmer and gives metrics to check the similarity

of two stemmers; however, it does not deal with the accuracy of the extracted stem. It also

does not measure the correctly stemmed words. It does not provide information about inva-

lid or modiﬁed stem production.

MCR does not measure the transformations of the stem and CI only check the compres-

sions ratio of vocabulary size, not the correctness of the stem. Moreover, it is diﬃcult to

identify all the conﬂation classes and checking corresponding stem words manually for

every conﬂation class is also a tedious task. Because some languages have high conﬂation

and derivation morphology like Hungarian or Hebrew languages which have thousands of

variant forms from a single word (Krovetz 2000). Consequently, its vocabulary Compres-

sion Index (CI) will be high. For instance, famous English stemmer Porter (1980) claimed

to reduce initial vocabulary by one third and Jabbar etal. (2018a, b) proposed Urdu stem-

mer that reduced vocabulary size by 55%.

4.1.4 Sirsat’s evaluation method

Sirsat etal. (2013) criterion is very compelling for assessing the strength and accuracy of a

stemming algorithm. The following parameters are used to evaluate the strength and accu-

racy of the stemmer.

Word stemmed factor (WSF) It refers to the average number of words stemmed from the

stemmer. The threshold value is the minimum (50%). The larger value of WSF signiﬁes the

strength of the stemmer. It can be calculated by Eq.(16)

(14)

WCF

N−C

(15)

MCR =

Total no.of letters removed

Total no.of words

MCR =(1+2+3+4)

=2.5

(16)

WSF

100

3 https ://antiﬂ ux.org/dicti onary ?dict=moby-thesa urus

Empirical evaluation andstudy oftext stemming algorithms

1 3

where

= No. of stem words,

= Total number of words in a sample

Correctly stemmed words factor (CSWF) It indicates the mean number of words cor-

rectly stemmed by the stemmer. The higher percentage of CSWF indicates the higher

strength and accuracy of the stemmer. Minimum threshold value of CSWF is 50% and it

can be calculated by the following Eq.(17)

where

CSW

= Number of correctly stemmed words,

= Total number of stemmed words.

Average words conﬂation factor (AWCF) This refers to the mean value of variant words

of a diﬀerent conﬂation group/s that are correctly stemmed. To calculate AWCF, we must

compute the number of distinct words after conﬂation, which is calculated by

. (18):

where

= Number of distinct stems after stemming,

= Number of correct words which

are not stemmed

Finally, AWCF is obtained by Eq.(19)

The higher value of AWCF indicates the higher strength and accuracy of the stemmer.

Sirsat etal. (2013) carried out the experiments over four stemmers [lovins (1968), porter

1(1980), porter 2 (2006), Paice/Husk(1990)] and concluded that the Paice/Husk stemmer

is slightly better than other stemmers in terms of ICF [64.63] and AWCF [19.26]. Lovins

(1968)’s stemmer has higher WSF 73.35 and porter 2 better with CSWF 34.76 as shown in

Table8.

The value of AWCF may be zero or negative when the number of incorrect stems is

larger than the correctly stemmed words.

4.1.5 Jaafar évaluations mechanism

Jaafar etal. (2017) used the execution time and accuracy of stemmer to determine the per-

formance of a stemmer using formula given in Eq.(20)

where,

GSscore

= Global Stemming Score,

= Execution time to get a stem for a word,

Accw

= Correctness of stem word

𝛼

𝛽

= Variables to give the weights of time taken by

(17)

CSWF

CSW

100

(18)

NWC =S−CW

(19)

AWC F

CSW−NWC

CSW

100

(20)

score =

𝛼.∑T

𝛽.∑Accw

Table 8 Results of Sirsat’s

evaluation method Stemmers ICF WSF CSWF AWCF

Paice/Husk 64.63 70.99 28.73 19.26

lovins 56.52 73.35 27.80 −24.8

Porter1 51.88 67.17 31.97 −8.52

Porter2 53.72 66.58 34.76 8.6

A.Jabbar et al.

1 3

the stemmer to extract the stem and accuracy. These values reﬂect which is more impor-

tant, accuracy or execution time, if accuracy is matter, then the value of

𝛼

is set higher than

𝛽

and

Accw

are two diﬀerent measurements.

is measured in the form of

n1,2,…

and

Accw

can be from 1 to 100. The relation between accuracy and execution time is always

inverse to each other. Ridiculus results may be obtained if weights are not properly

assigned.

4.2 Gold standard assessments

In this evaluation approach, the correctness of a system is manually checked by experts.

For this purpose, input is given, and its corresponding output is checked manually.

Many statistical and rule-based stemmers are evaluated manually such as Ali et al.

(2019) Urdu stemmer, Al-Kabi et al. (2015) Arabic stemmer and Persian stemmer

(Taghi-Zadeh etal. 2015). The accuracy of the stemmer by gold standard assessment

can be calculated as given in

. (21)

This method is good for small sized dataset however it is not suitable for large-scale

evaluation. This method reﬂects the ratio of correct stems produced by a stemmer, but

is silent about already stem words given to a stemmer. TP and TN both are important to

determine the performance of a stemmer.

4.3 Indirect evaluation

Stemming reduces the dimensionality of the text data and improves the performance

of information retrieval (IR) system (de Oliveira and Junior 2018). The performance

of a stemmer can also be evaluated in the context of speciﬁc NLP applications such as

Alotaibi and Vishal Gupta (2018) evaluated their proposed stemmer in an IR system.

Ali etal. (2018) tested their stemmer for text classiﬁcation and Boukhalfa etal. (2018)

proposed Arabic stemmer to improve the performance of Arabic plagiarism detection

system. Many researchers compared the performance of various stemming algorithms

for IR system such as Flores and Moreira (2016) to evaluate on Portuguese, Spanish,

French, and English language stemmers in IR experiments and concluded 70% of the

query topics improvement in AP (average precision). Karaa (2013) modiﬁed the Porter

(1980) stemmer and claimed that new porter stemmer improves the IR system by 0.852

precision and 0.884 recall. In contrast without stemming precision is 0.661 and recall is

(21)

Accuracy

Tot al N o.of correct stem obtained

Tot al N o.of words given to stemmer

100

Table 9 Stemmer’s performance

in the IR system (Karaa 2013)Used stemmer in IR Precision Recall

Without stemming 0.661 0.671

Original porter stemmer 0.732 0.775

New porter stemmer 0.852 0.884

Empirical evaluation andstudy oftext stemming algorithms

1 3

0.671 and with orginal porter (1980) stemmers precision 0.732 and recal 0.775 shown

in Table9.

The recall, precision, and F1-measure (Jabbar etal. 2018a, b) are standard measures

to assess the performance of stemmer in an IR system.

Recall Recall indicates the ratio between stem words extracted by stemmer and total

possible stem words as mentioned in Eq.(22).

Precision Precision is a ratio of total correct stems and total produced stems. It is

calculated by Eq.(23)

Weighted F1-measure A variant of F1-measure allows weighting emphasis on preci-

sion over recall. It is calculated by (Eq.24).

where Β = Weighting between precision and recall typically β = 1.

A weighted combination of recall and precision In addition to the standard precision/

recall measures, several other methods are also adopted by the researchers such as Lennon

etal. (1998) who have used a weighted combination of recall and precision as given in

Eq.(25).

where

= an eﬀective function, lower value of E indicates better performance,

= Preci-

sion,

= Recall,

= measures the relative importance attached to PR

AP and Mean Average Precision (MAP) are also used to evaluate the impact of stem-

ming in an IR system. AP refers to the mean of precision and recall; whereas, the MAP

represents the average of AvPs when more than one query is used (Flores and Moreira

2016).

TERRIER (Qunis et al. 2006) is an open source IR system that is a highly ﬂexible,

eﬃcient and comprehensive platform for carrying out stemming experiments. TERRIER

(Qunis etal. 2006) is developed by the School of Computing Science, the University of

Glasgow in JAVA programming language and is available at https ://terri er.org/. It supports

UTF (Unicode Transformation Format) text hance corpora of many languages can be used.

It uses Porter stemmer (1980) by default. Many other IR systems are also available such as

Lemur/Indri (“Lemur” 2016) Lucene/Solr (“Lucene” 2018), Xapian (“Xapian” 2018). All

IR systems can be used to perform basic IR tasks. However, TERRIER has some deﬁcien-

cies that may include:

• It is diﬃcult to check how many words are relevant in the corpus.

• It is hard to choose a stemmer for search engine because every search engine has a dif-

ferent database.

• The results are not reliable because every stemmer is evaluated on diﬀerent datasets.

(22)

Recall =

TotalCorrect Stem Produce By System

TotalPossible Correct Stem

(23)

Precision

TotalCorrect Stem Got By System

TotalStem Produce By System

(24)

1weighted =

(

𝛽

)

(precision ×recall

)

𝛽

(recall)+precision

(25)

=1−

(

1+b

2)PR

P+R

A.Jabbar et al.

1 3

There is no mechanism deﬁned to ﬁnd number of produced stem words and actual stem

words because the platform only tells whether a word is stemmed or not. Moreover, it does

not tell the degree of the correctness of the stem.

5 Analysis ofevaluation methods

Every evaluation method measures the speciﬁc features of the stemmer and ignores the

rest. For example, Al-Shammari and Lin (2008) assessed the performance of proposed

Arabic stemmer named Educated Text Stemmer (ETS) using Paice (1994) evaluation

method and claimed that ETS stemmer (Al-Shammari and Lin 2008) performed better

than Khoja and Garside (1999)’s stemmer. However, they obtained “0” value of stem-

ming weight (SW) for both above-mentioned stemmers which show that both stemmers

are equal in strength and performance. Whereas, according to the gold standard evalu-

ation method ETS stemmer’s accuracy is 100% that is better than Khoja and Garside

(1999) which achieve 70% stemmer’s accuracy, as shown in Table10. The authors ran-

domly choose two samples of Arabic documents. First sample consists of 47 medical

documents that contained 9,435 Arabic words, and the second sample comprises 10

long, Arabic sports articles from CNN.com with total 7,071 words. (Al-Shammari and

Lin 2008)

Paice (1994) evaluation method, in some experiments, shows the contrary results

from the gold standard evaluation method as shown in Table 9. This controversy in

results can also be seen in Brazilian Portuguese languages (Alvares et al. 2005), in

Table 10 Comparison of Paice

method with the manual method Stemmer Paice’s evaluation Gold standard method

UI OI SW Accuracy (%)

ETS stemmer (Al-Sham-

mari and Lin 2008)

0 0 0 100

Khoja stemmer

(Khoja and Garside 1999)

0 0.0755 070

Table 11 Results of the test

using the sample I Stemmers References Accuracy (%) SW

STEMBERS Alvares etal. (2005)62.20 1.60 × 10–3

STEMP Orengo and Huyck (2001) 55.30 1.44 × 10–3

PORTER Porter (1980) 43.80 0.67 × 10–3

Table 12 Results of the test

using sample II Stemmers References Accuracy (%) SW

STEMBERS Alvares etal. (2005)69.02 3.50 × 10–4

STEMP Orengo and Huyck (2001) 67.60 3.30 × 10–4

PORTER Porter (1980) 57.86 1.25 × 10–4

Empirical evaluation andstudy oftext stemming algorithms

1 3

which results produced by Paice (1994) are totally opposite to the accuracy measured

by Gold standard/manual methods. Similarly, STEMBERS stemmer’s performance

is 62.20% that is better than the counter stemmer STEMP [55.30%] and PORTER

[43.80%], but the SW value

1.60 ×10−3

of STEMBERS shows the lowest performance

among evaluated stemmers as mentioned in Table 11. By observing Table 12, similar

performance is reﬂected in sample II experimental data, in which STEMBERS stemmer

achieved an accuracy of 69.02% that is higher than its counterpart stemmer i.e. STEMP,

PORTER. However, SW is 3.50 × 10−4, which is equal to STEMP stemmer but higher

than the PORTER [1.25 × 10−4] stemmer. Sample I and sample II comprosis 102 and

2,696 semantic groups constructed manually.

This trend in Paice evaluation method has also been proved by AlSerhan and

Alqrainy (2008), where, they compared the results obtained through manual method

and Paice evaluation method for Arabic language using two virtual stemmer’s results

(AlSerhan and Alqrainy 2008). It is depicted from the obtained results that Paice eval-

uation method denies the gold standard results. As stated before, the 0 value of SW

shows the strongest stemmer, and 0 value is obtained when OI, or UI or both are zeros

as mentioned in Table12. When gold standard accuracy is 100% or 0% as shown in

Table13(2nd column), in both cases, the value of SW is zero that is far from reality.

Table 13 Experiment’s result of

Paice evaluation method Author Stemmer Accuracy (%) OI UI SW

AlSerhan etal. (2008) Stem1 100 0 0 0

Stem2 100 0 0 0

Stem1 0 0 0 0

Stem2 0 0 0 0

Stem1 28.57 1 0 0

Stem2 85.71 0.2 0.36 0.56

Table 14 Comparison of manual

and Sirsat’s method Stemmer Sirsat’s method Manual

WSF CSWF AWC F Accuracy

HStemV1 69.47 81.12 52.83 56.35

HStemV2 65.7 87.37 59.45 57.39

Table 15 Comparison of Gold standard, Frakes and Sirsat’s evaluation parameters

Stemmer Frakes evaluation metrics Sirsat’s evaluation

method

Gold standard method

ICF MWC WCF MCR WSF CSWF Accuracy (%)

Light10 51.08 2.04 80.86 1.57 80.84 32.2 14.96

Motaz 36.04 1.56 54.31 0.64 54.31 57.75 18.59

Tashaphyne 69.94 3.32 87.23 2.02 87.23 22.63 10.95

SAFAR-stemmer 48.22 1.93 62.55 1.11 62.55 80.61 33.70

A.Jabbar et al.

1 3

But when accuracy is 85.71% obtained by gold standard method, its corresponding SW

value is 0.56, and the accuracy of its competitors is 28.57 that is lower, but SW value is

0 that means Stem2 is a stronger stemmer than Stem1 as reﬂected from Table13.

Some of Sirsat’s parameter’s values (Sirsat etal. 2013) showed a tendency in gold stand-

ard’s results as shown in Table 14. Where, HStemV2 showed better performance in terms

of accuracy as shown in Table14. On the other hand, with respect to Sirsat’s parameter

(Sirsat etal. 2013) WSF, HStemV1 performs better as shown in Table14.

Frakes evaluation method also talks about how many words are changed when stem-

mer ignored the correctness. Jabbar et al. (2016) used the Quranic Arabic Corpus

(Dukes and Habash 2010) which contains 18,350 unique Arabic words to compare the

results with counterpart stemmers Light10 (Larkey etal. 2007), Motaz (Saad and Ash-

our 2010), and Tashaphyne (Zerrouki 2016). SAFAR-Stemmer (Jaafar etal. 2016) pro-

vided results statistics about their stemmer. Form these calculations, we compute the

Frakes evaluation (Frakes and Fox 2003) parameters (as shown in Table14). Sirsat’s

parameters (Sirsat etal. 2013) and manual accuracy are also calculated and mentioned

in Table15. SAFAR-Stemmer (Jaafar etal. 2017) achieved 33.70 accuracy that is higher

than its competitor Light10 (Larkey etal. 2007) [14.96], Motaz (Saad and Ashour 2010)

[18.59] and Tashaphyne (Zerrouki 2016) [10.95]. Sirsat’s evaluation parameter WSF is

62.55 and CSWF is 80.61 which is highest among the stemmers. Sirsat’s evaluation and

gold standard method show that the SAFAR-Stemmer is better than Light10 (Larkey

et al. 2007), Motaz (Saad and Ashour 2010), and Tashaphyne (Zerrouki 2016) stem-

mers. However, Frakes evaluation metrics (Frakes and Fox 2003) deny this result as

Tashaphyne (Zerrouki 2016) performed better with respect to the higher values of ICF

[69.94], MWC [3.32], WCF [87.23], and MCR [2.02] as shown in Table14.

Considering the observation mentioned in Table15 the performance graph of men-

tioned stemmers varied with respect to evaluation methods. So, to be more accurate,

we develop a virtual scenario for the scarce resourced Urdu language stemming (see

Table16) and experimented two virtual stemmers i.e. VS1 and VS2.

Virtual stemmer1 Paice evaluation

Unachieved Merge Total (UMT) derived using (Eq.5).

Table 16 Experimental data for Paice evaluation

Groups Input words Actual stem VS1 VS2

G1  [Bdnsaz/Bodybuilder]  [Bdan/body]  [bdan/body]  [Bd/bad]

 [Bdnsazi/Bodybuilding]  [Bdan/body]  [bdan /body]  [Bd/bad]

 [Bdni/bodily]  [Bdan/body]  [bdan /body]  [Bd/bad]

 [Abdan/bodies]  [Bdan/body]  [bdan /body] [Bd/bad]

 [Abdano/bodies]  [Bdan/body]  [bd/bad] [Bd/bad]

G2  [Bdpan/badness]  [Bd/bad]  [bdan/body]  [Bd/bad]

 [Bdniah/bad luck]  [Bd/bad]  [Bd/bad]  [Bd/bad]

G3 [Bad roh/Evil spirit] [Roh/soul]  [Roh/soul]  [Bd/bad]

[Bad rohain/Evil spirits] [Roh/soul]  [Roh/soul]  [Bd/bad]

Empirical evaluation andstudy oftext stemming algorithms

1 3

Then for a particular group

= Number of distinct stems produced by stemmer (a stemmer may produce multiple

stems for a particular group of words).

ng1

= possible number of morphological forms in particular group having same stem.

The total number of the stems  [Bdan/body] and  [Bd/bad] in group 1.

= ith stem produced by stemmer. For

u1,4

stems of word  [Bdan/body] in group 1

and for

u2,1

stem  [Bd/bad] in group 1 is produced.

We calculate the UMT [using Eq. (5)] for both stemmers and for each group that is

shown in Table16. The UMT is the only one that valeted the conﬂation regardless of

correctness (as mentioned in Table16). The value of UMT is 0 if all words in the group

are conﬂated (correctly or incorrectly). This causes the inverse result of the gold strand

evaluation results. UsingEq. (3), we can calculate:

The DMT values of both stemmers against each group are mentioned in Table16.

UMT

g=1



i=1

uing−ui(5

)

UMTg1=

2[4×(5−4)+1×(5−1)

]

UMTG1

DMTg=

2ng(ng−1)(3

)

DMT

g1=1

2[5×(5−1)]

DMTg1

=10

WMT

∑

i=1

vi(ns−vi

)

Table 17 Calculation of UMT,

DMT, WMT, DNT Stemmer G UMT DMT WMT DNT

VS1 G1 4 10 4 10

G2 1 1` 1 7

G3 0 1 0 7

VS2 G1 0 10 0 10

G2 0 1 24 7

G3 0 1 0 7

Table 18 Calculation of GUMT,

GDMT, GWMT, GDNT Stemmer GUMT GDMT GWMT GDNT

VS1 5 12 5 24

VS2 0 12 24 24

A.Jabbar et al.

1 3

Table 19 Experimental results to check the viability of various stemming evaluation methods

Stemmer Paice (1994, 1996)’s evaluation method Frakes and Fox (2003)’s Evaluation Metrics Sirsat etal. (2013)’s evaluation

method

Gold

standard

evaluation

UI OI SW MWC ICF WCF MCR WSF CSWF AWC F Accuracy

(%)

VS1 0.42 0.21 0.5 3 66.67 12.3 100 77.8 57 77.8

VS2 0 1 0 9 88.89 1 3.3 100 22.2 50 22.2

Empirical evaluation andstudy oftext stemming algorithms

1 3

. The total number of the stems  [Bdan/body] in group 1 and group 2,

v1=4

The

number of stem  [Bdan/body] in group 1,

v2=1

The number of stem  [Bdan/body] in

group 2

The values of WMT (Eq.7) for each stemmer against each group are represented in

Table17.

For ﬁrst group

The remaining values of

DNTg

are calculated and are provided in Table17.

The values of GUMT (Eq.6), GDMT (Eq.7), GWMT (Eq.8), and GDNT (Eq.10) are

calculated from Table17 and are given in Table18.

From Table19, it is observed that the accuracy obtained by VS1 is comparatively higher

[77.8] than VS2 stemmer [22.2].

It is clear from Table18 that the stemming weight SW shows that the VS1 [0.5] is a

light stemmer and VS2 [0] is a heavy stemmer. These results indicate that the stemmer SV2

is a perfect stemmer that is quite opposite of manual evaluation results.

The MWC value is 3 for VS1 and VS2 is 9 that is higher. It is observed from Table17

that index ICF obtained by VS2 [88.89] is higher than the VS1 [66.67]. The WCF is 1 for

both stemmers, but, MCR is higher for VS2 [3.3] than VS1 [2.3]. Frakes proposed param-

eters to assess the performance of the stemmer which gives conﬂicting results as that of the

gold standard method.

The WSF obtained by both stemmers is 100% that is above a threshold value [50%].

This indicates that the strength of both the stemmers is better and aggressive in nature.

The AWCF value of stemmer VS1 is 57 and 50 for VS2 that shows VS1 is stronger and

more aggressive than the stemmer VS2. However, there is a comparatively large diﬀerence

between both the stemmers with respect to CSWF, VS1 is 77.8 and VS2 is 22.28 that show

the same tendency as a gold standard evaluation method reﬂects. But two other parameters

give contrast results from the manual assessment.

6 Challenges andfuture directions

The text stemming is well studied and still open area of work especially for Arabic script

based langauges. Text Stemming has been recognized as an excellent pre-processing tool

in many NLP applications. Several approaches are proposed by the researchers facing vari-

ous issues and challenges. However, how to evaluate a stemmer is an open question. In this

WMT

G1=

2[4×(5−4)+1×(5−1)

]

WMTG1

DNT

ng(w−ng

)

w=9n

DNT

g1=1

2[5×(9−5)

]

DNTg1

=10

A.Jabbar et al.

1 3

section, we present diﬀerent issues and challenges to evaluate a stemmer that must be con-

sidered by the researchers in this domain.

Language type Concatenative and non-concatenative are two categories of languages

with respect to the morphological process. For a concatenative language, the aﬃx and

stem are traced in a linear fashion but in non-concatenative language, the stem and aﬃx

are intervening (Kastner 2019). As the stemming approaches are language dependent,

the stemmer evaluation methods are also changed accordingly. For instance, in the case

of Semitic language, Paice (1994, 1996) evaluation method is not suitable (AlSerhan and

Alqrainy 2008).

Types of aﬃxes A variety of aﬃxes is used in various languages such as the Indonesian

language possess preﬁx, suﬃx, conﬁxes, and suﬃx (Setiawan etal. 2016). State of the art

stemming evaluation methods do not tell us about the types of aﬃxes that are handled and

the types of aﬃxes that are not considered.

Text stemming approaches Most of the stemmers presented in the literature are Linguis-

tic knowledge-based, that handle more than one aﬃx type. On the other hand, statistical

stemmers only remove the suﬃx. With the development of high end AI methods like neural

networks, development of stemmers may produce superior results.

Domain-speciﬁc Some researchers evaluate the stemmer in a particular NLP applica-

tion. Every application only considers the speciﬁc feature of the language. For instance,

English IR system, Porter (1980) claimed that only suﬃx removal is enough; however, it is

not suitable for Semitic language like Arabic.

Besides above all, certain issues like computation complexity, space complexity, lin-

guistic correctness and considered types of aﬃxes are helpful to determine the perfor-

mance of a stemmer. Hence, the metrics to evaluate a stemmer must address all issues men-

tioned above.

7 Conclusions

The purpose of stemming algorithms is very simple that is to extract the same stem from

various conﬂation forms of words. Various researchers proposed diﬀerent parameters to

measure the performance of text stemming algorithms. However, each criterion only meas-

ures speciﬁc aspects of stemmer performance. Diﬀerent text stemming evaluation (TSE)

methods proved to be useful in case of speciﬁc NLP applications.

In the developed NLP systems that use stemming, there is no standard TSE-method

which can provide a landmark to measure the performance of the stemming algorithms. A

stemmer performance may increase or decrease if diﬀerent evaluation methods are used.

Therefore, there is a certain need to develop a standard evaluation method. The reason for

such result lies in the type of experimental data, training data, size of data and construction

of stemming rules (if rule based approach is used).

A varity of features such as aﬃx types, language types, data sets types and size can be

used to develop a robust stemmer’s evaluation mechanism that consider the conﬂation ratio

as well as linguistically correctness of the stem. We conclude that this article provides a

comprehensive review of the state-of-the-art text stemming evaluation methods, their chal-

lenges and the avenues for future work.

Funding Funding was provided by Bahauddin Zakariya University (PK) (Grant No: 2019-05).

Empirical evaluation andstudy oftext stemming algorithms

1 3

References

Ababneh M, Al-Shalabi R, Kanaan G, Al-Nobani A (2012) Building an eﬀective rule-based light stemmer

for arabic language to improve search eﬀectiveness. Int Arab J Inf Technol 9(4):368–372

Abainia K, Ouamour S, Sayoud H (2017) A novel robust Arabic light stemmer. J Exp Theor Artif Intell

29(3):557–573

Abu-Errub A, Odeh A, Shambour Q, Hassan OAH (2014) Arabic roots extraction using morphological

analysis. Int J Comput Sci Issues (IJCSI) 11(2):128

Ali M, Khalid S, Aslam MH (2018) Pattern-based comprehensive Urdu stemmer and short text classiﬁca-

tion. IEEE Access 6:7374–7389

Ali M, Khalid S, Saleemi M (2019) Comprehensive stemmer for morphologically rich urdu language. Int

Arab J Inf Technol 16(1):138–147

Alotaibi FS, Gupta V (2018) A cognitive inspired unsupervised language-independent text stemmer for

Information retrieval. Cognit Syst Res 52:291–300

Al-Kabi MN, Kazakzeh SA, Ata BMA, Al-Rababah SA, Alsmadi IM (2015) A novel root based Arabic

stemmer. J King Saud Univ-Comput Inf Sci 27(2):94–103

Al-Omari A, Abuata B (2014) Arabic light stemmer (ARS). J Eng Sci Technol 9(6):702–717

AlSerhan HM, Alqrainy S, Ayesh A (2008, November). Is paice method suitable for evaluating Arabic stem-

ming algorithms? In: International conference on computer engineering & systems, 2008 (ICCES

2008). IEEE, pp 131–135

Al-Shammari ET, Lin J. (2008, October). Towards an error-free Arabic stemming. In Proceedings of the

2nd ACM workshop on Improving non English web searching. ACM, pp 9–16

Al-Sughaiyer IA, Al-Kharashi IA (2004) Arabic morphological analysis techniques: A comprehensive sur-

vey. J American Soc Inf Sci Tech 55(3):189–213

Alvares RV, Garcia AC, Ferraz I (2005) December) STEMBR: a stemming algorithm for the Brazilian Por-

tuguese language. Portuguese conference on artiﬁcial intelligence. Springer, Berlin, pp 693–701

Aronoﬀ M, Fudeman K (2011) What is morphology? vol. 8.Wiley, pp 2–3

Bimba A, Idris N, Khamis N, Noor NF (2016) Stemming Hausa text: using aﬃx-stripping rules and ref-

erence look-up. Lang Resour Eval 50(3):687–703

Bölücü, Necva and Burcu Can. (2019). Unsupervised Joint PoS Tagging and Stemming for Agglutinative

Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 3, Article 25 (January 2019),

21 pages. https ://doi.org/10.1145/32923 98

Boudchiche M, Mazroui A (2015, December). Evaluation of the ambiguity caused by the absence of dia-

critical marks in Arabic texts: statistical study. In: 2015 5th international conference on informa-

tion and communication technology and accessibility (ICTA). IEEE, pp 1–6

Boukhalfa I, Mostefai S, Chekkai N (2018, March) A study of graph based stemmer in Arabic extrinsic

plagiarism detection. In: Proceedings of the 2nd mediterranean conference on pattern recognition

and artiﬁcial intelligence. ACM, pp 27–32

Brychcín T, Konopík M (2015) HPS: high precision stemmer. Inf Process Manag 51(1):68–91

Buckley C (1985) Implementation of the smart information retrieval system. Technical report 85–686,

Cornell University.

Cambria E, White B (2014) Jumping NLP curves: a review of natural language processing research.

IEEE Comput Intell Mag 9(2):48–57

Chintala DR, Reddy EM (2013) An approach to enhance the CPI using Porter stemming algorithm. Int J

Adv Res Comput Sci Softw Eng 3(7):1148–1156

CISI Collection https ://ir.dcs.gla.ac.uk/resou rces/test_colle ction s/cisi/. Accessed 30 Dec 2019. Devel-

oped by University of Glasgow

Dahab MY, Ibrahim A, Al-Mutawa R (2015) A comparative study on Arabic stemmers. Int J Comput

Appl 125(8):38–47

Dang Q, Zhang J, Lu Y, Zhang K (2013) WordNet-based suﬃx tree clustering algorithm. In: Interna-

tional conference on information science and computer applications (ISCA 2013)

Dey A, Paul A, Purkayastha BS (2014) Named entity recognition for Nepali language: a semi hybrid

approach. Int J Eng Innov Technol (IJEIT) 3:21–25

Dianati MH, Sadreddini MH, Hossein RA, Fakhrahmad SM, Taghi-Zadeh H (2014) Words stemming based

on structural and semantic similarity. Comp Eng Appl J 3(2):89–99

de Oliveira RAN, Junior MC (2018) Experimental analysis of stemming on jurisprudential documents

retrieval. Information 9(2):28

Dukes K, Habash N (2010) Morphological annotation of Quranic Arabic. In Lrec, pp 2530–2536

El-Defrawy M, El-Sonbaty Y, Belal NA (2016) A rule-based subject-correlated Arabic stemmer. Arab J

Sci Eng 41(8):2883–2891

A.Jabbar et al.

1 3

Fattah MA, Ren F, Kuroiwa S (2006) Stemming to improve translation lexicon creation form bitexts. Inf

Process Manag 42(4):1003–1016

Flores FN, Moreira VP (2016) Assessing the impact of stemming accuracy on information retrieval–a

multilingual perspective. Inf Process Manag 52(5):840–854

Frakes WB, Fox CJ (2003) Strength and similarity of aﬃx removal stemming algorithms. In ACM

SIGIR forum, vol 37, no 1. ACM, pp 26–30.

Gaidhane MS, Gondhale MD, Talole MP (2015) A comparative study of stemming algorithms for natu-

ral language processing. J Eng Educ Technol (ARDIJEET) 3(2):1–6

Giachanou A, Crestani F (2016) Like it or not: a survey of twitter sentiment analysis methods. ACM

Comput Surv (CSUR) 49(2):28

Harman D (1991) How eﬀective is suﬃxing. J Am Soc Inf Sci 42(1):7–15

Hassani K, Lee WS (2016) Visualizing natural language descriptions: a survey. ACM Comput Surv

(CSUR) 49(1):17

Husain MS, Ahamad F, Khalid S (2013) A language independent approach to develop Urdu stemmer.

Advances in computing and information technology. Springer, Berlin, pp 45–53

Hull DA (1996) Stemming algorithms—a case study for detailed evaluation. J Am Soc Inf Sci 47:70–84

Hussain Z, Iqbal S, Saba T, Almazyad AS, Rehman A (2017) Design and development of dictionary-

based stemmer for the urdu language. J Theor Appl Inf Technol 95(15):3560–3569

Islam Md, Uddin Md, Khan M (2007) A light weight stemmer for Bengali and its use in spelling checker.

Retrieved 24 March, 2019, from http://hdl.handl e.net/10361 /328

Ismailov A, Jalil MA, Abdullah Z, Rahim NA (2016) A comparative study of stemming algorithms for

use with the Uzbek language. In: 3rd international conference on computer and information sci-

ences (ICCOINS), 2016. IEEE, pp 7–12

Jaafar Y, Namly D, Bouzoubaa K, Yousﬁ A (2017) Enhancing Arabic stemming process using resources

and benchmarking tools. J King Saud Univ-Comput Inf Sci 29(2):164–170

Jabbar A, Iqbal S, Khan MUG (2016a) Analysis and development of resources for Urdu text stemming.

In: Proceedings of the 6th annual international conference on language and technology, KICS-

CLE, UET Lahore

Jabbar A, Iqbal S, Akhunzada A, Abbas Q (2018a) An improved Urdu stemming algorithm for text min-

ing based on multi-step hybrid approach. J Exp Theor Artif Intell. https ://doi.org/10.1080/09528

13X.2018.14674 95

Jabbar A, Iqbal S, Khan MUG, Hussain S (2018b) A survey on Urdu and Urdu like language stemmers and

stemming techniques. Artif Intell Rev 49(3):339–373

Jabbar A, Iqbal S, Khan MUG, Hussain S (2018b) A survey on Urdu and Urdu like language stemmers and

stemming techniques. Artif Intell Rev 49(3):339–373

Jivani AG (2011) A comparative study of stemming algorithms. Int J Comp Tech Appl 2(6):1930–1938

Karaa WBA (2013) A new stemmer to improve information retrieval. Int J Netw Secur Appl 5(4):143

Karimi S, Wang C, Metke-Jimenez A, Gaire R, Paris C (2015) Text and data mining techniques in adverse

drug reaction detection. ACM Comput Surv (CSUR) 47(4):56

Kastner I (2019) Templatic morphology as an emergent property. Nat Lang Linguist Theory 37(2):571–619

Khalid A, Hussain Z, Baig MA (2016) Arabic stemmer for search engines information retrieval. Int J Adv

Comput Sci Appl 1(7):407–411

Khan S, Waqas A, Usama B, Xuan W (2015) Template based aﬃx stemmer for a morphologically rich lan-

guage. Int Arab J Inf Tech 12(2):146–154

Khoja S, Garside R (1999) Stemming arabic text. Lancaster University, Lancaster, UK, Computing

Department

Krovetz R (2000) Viewing morphology as an inference process. Artif intel 118(1–2):277–294

Larkey LS, Ballesteros L, Connell ME (2007) Light stemming for Arabic information retrieval. Arabic com-

putational morphology. Springer, Dordrecht, pp 221–243

Lemur (2016) https ://www.lemur proje ct.org. Accessed 14 Aug 2018

Lennon M, Peirce DS, Tarry BD, Willett P (1981) An evaluation of some conﬂation algorithms for informa-

tion retrieval. Inf Sci 3(4):177–183

Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11(1–2):22–31

Lucene (2018) https ://lucen e.apach e.org. Accessed 12 Aug 2018

Mateen A, Malik MK, Nawaz Z, Danish HM, Siddiqui MH, Abbas Q (2017) A hybrid stemmer of punjabi

shahmukhi script. Int J Comput Sci Netw Secur 17(8):90–97

McCormick C (2016) Word2Vec tutorial—the skip-gram model. https ://www.mccor mickm l.com

Mishra U, Prakash C (2012) MAULIK: an eﬀective stemmer for Hindi language. Int J Comput Sci Eng

4(5):711–717

Empirical evaluation andstudy oftext stemming algorithms

1 3

Mochizuki M, Aizawa K (2000) An aﬃx acquisition order for EFL learners: an exploratory study. System

28(2):291–304

Moghadam FM, MohammadReza K (2015) Comparative study of various Persian stemmers in the ﬁeld of

information retrieval. J Inf Proc Syst 11(3):450–464

Momenipour F, Keyvanpour MR (2016) PHMM: stemming on Persian texts using statistical stemmer based

on hidden Markov Model. Int J Inf Sci Manag 14(2):107–117

Mustafa AM, Rashid TA (2018) Kurdish stemmer pre-processing steps for improving information retrieval.

J Inf Sci 44(1):15–27

Nguyen, (2013) Nguyen DT, Leveling J (2013) Exploring domain-sensitive features for extractive summari-

zation in the medical domain. International conference on application of natural language to informa-

tion systems. Springer, Berlin, pp 90–101

Nwesri AFA, Alyagoubi HAH (2015). Applying arabic stemming using query expansion. In 2015 26th

international workshop on database and expert systems applications (DEXA) (pp. 299–303). IEEE

Orengo VM, Huyck C (2001) a stemming algorithm for the portuguese language. In; SPIRE ’01: Proceed-

ings of eigth symposium on string processing and information retrieval, pp 186–193.

Paice CD (1990) Another stemmer. SIGIR Forum 24(3):56–61

Paice CD (1996) Method for evaluation of stemming algorithms based on error counting. J Am Soc Inf Sci

47(8):632–649

Paice CD (1994) An evaluation method for stemming algorithms. In: Proceedings of the 17th annual inter-

national ACM SIGIR conference on research and development in information retrieval. Springer,

New York, pp 42–50

Pande BP, Tamta P, Dhami HS (2018) Generation, implementation and appraisal of an N-gram based stem-

ming algorithm. Digit Scholarsh Humanit. https ://doi.org/10.1093/llc/fqy05 3

Paik JH, Pal D, Parui SK (2011) A novel corpus-based stemming algorithm using co-occurrence statistics.

In: Proceedings of the 34th annual international ACM SIGIR conference on research and develop-

ment in information retrieval (SIGIR’11). ACM, New York, pp 863–872

Patil CG, Patil SS (2013) Use of Porter stemming algorithm and SVM for emotion extraction from news

headlines. Int J Electron Commun Soft Comput Sci Eng 2(7):9–13

Porter MF (2006) https ://snowb all.artar us.org/algor ithms /engli sh/ stemmer.html

Porter MF (1980) An algorithm for suﬃx stripping. Program 14(3):130–137

Qureshi AH, Hassan MU, Akhter S (2018) Towards description of derivation in Urdu: morphological per-

spective. Al-Qalam 23(2):96–100

Rani SPR, Ramesh B, Anusha M, Rani SJGR (2015) Evaluation of stemming techniques for text classiﬁca-

tion. Int J Comput Sci Mobile Comput 4(3):165–171

Rashid TA, Mohamad SO (2016) Enhancement of detecting wicked website through intelligent methods.

International symposium on security in computing and communication. Springer, Singapore, pp

358–368

Rashidi A, Lighvan MZ (2014) HPS: a hierarchical Persian stemming method. arXiv preprint

arXiv:1403.2837.

Rehman Z, Anwar W, Bajwa UI, Xuan W, Chaoying Z (2013) Morpheme matching based text tokenization

for a scarce resourced language. PLoS ONE 8(8):e68178

Saad MK, Ashour W (2010) Arabic morphological tools for text mining. Corpora 18:19

Saeed AM, Rashid TA, Mustafa AM, Al-Rashid Agha RA, Shamsaldin AS, Al-Salihi NK (2018a) An evalu-

ation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classiﬁcation.

Iran J Comput Sci 1(2):99–107

Saeed AM, Rashid TA, Mustafa AM, Fattah P, Ismael B (2018b) Improving Kurdish web mining through

tree data structure and Porter’s Stemmer algorithms. UKH J Sci Eng 2(1):48–54

Sarma B, Purkayastha BS (2013) An aﬃx based word classiﬁcation method of assamese text. Int J Adv Res

Comput Sci 4(9):213–216

Schoﬁeld A, Mimno D (2016) Comparing apples to apple: the eﬀects of stemmers on topic models. Trans

Assoc Comput Linguist 4:287–300

Setiawan R, Kurniawan A, Budiharto W, Kartowisastro IH, Prabowo H (2016) Flexible aﬃx classiﬁcation

for stemming Indonesian Language. In: 2016 13th international conference on electrical engineering/

electronics, computer, telecommunications and information technology (ECTI-CON). IEEE, pp 1–6

Singh J, Gupta V (2016) Text stemming: approaches, applications, and challenges. ACM Comput Surv

(CSUR) 49(3):45

Singh J, Gupta V (2017) An eﬃcient corpus-based stemmer. Cognit Comput 9(5):671–688

Sirsat SR, Chavan V, Mahalle HS (2013) Strength and accuracy analysis of aﬃx removal stemming algo-

rithms. Int J Comput Sci Inf Technol 4(2):265–269

A.Jabbar et al.

1 3

Sulaiman S, Omar K, Omar N, Murah MZ, Abdul Rahman HD (2014) The eﬀectiveness of a Jawi stem-

mer for retrieving relevant Malay documents in Jawi characters. ACM Trans Asian Lang Inf Process

(TALIP) 13(2):6

Suryani AA, Widyantoro DW, Purwarianti A, Sudaryat Y (2018) The rule-based sundanese stemmer. ACM

Trans Asian Low-Resour Lang Inf Process (TALLIP) 17(4):27

Taghi-Zadeh H, Sadreddini MH, Diyanati MH, Rasekh AH (2015) A new hybrid stemming method for per-

sian language. Digit Scholarsh Humanit 32(1):209–221

Thangarasu M, Manavalan R (2013) Design and development of stemmer for Tamil language: cluster analy-

sis. Int J Adv Res Comput Sci Softw Eng 3(7):812–818

The free dictionary (2018) https ://www.thefr eedic tiona ry.com/. Accessed 03 Aug 2018

Qunis I, Amati G, Plachouras V, He B, Macdonald C, Lioma C (2006) A high performance and scalable

information retrieval plateform. In: SIGR workshop on open source information retrieval

Urdu L (2006) https ://182.180.102.251:8081/oud/help_3.htm. Accessed 04 Aug 2018

Xapian (2018) https ://xapia n.org. Accessed 07 Aug 2018

Xer (1994) Xeror linguistic database reference, English version 1.1.4 ed.s

Yadollahi A, Shahraki AG, Zaiane OR (2017) Current state of text sentiment analysis from opinion to emo-

tion mining. ACM Comput Surv (CSUR) 50(2):25

Zerrouki T (2016) Tashaphyne 0.2 (Online). https ://pypi.pytho n.org/pypi/Tasha phyne . Accessed 14 Apr

2016

Zhou D, Mark T, Brailsford T, Wade V, Ashman H (2012) Translation techniques in cross-language infor-

mation retrieval. ACM Comput Surv (CSUR) 45(1):1

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and

institutional aﬃliations.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Artificial Intelligence Review

This content is subject to copyright. Terms and conditions apply.

Descriptive Answers Evaluation Using Natural Language Processing Approaches

Article

Full-text available

Jan 2024

Answer scripts are an important aspect in evaluating student’s performance. Evaluating papers from a descriptive outlook can be a challenging and exhausting task. Typically, answer script evaluations are conducted dynamically, which can lead to bias and can be quite time-consuming. Various efforts have been made to automate the evaluation of student responses with the usage of Artificial Intelligence techniques. Yet, most of the work relies on particular words or typical counts to accomplish this task. In addition, there is a shortage of organized data sets too. In this research a novel ensemble model Descriptive answer evaluation system(DAES) is introduced, which integrates Topic Modelling (TM) and Question Answering (QA) models for automatically evaluating descriptive answers. Latent Dirichlet Allocation (LDA) and a fine-tuned Text-to-Text Transfer Transformer(T5) models were utilized to identify key topics and the correctness of specific statements within the student answers. Sentence-BERT is utilized to encode sentences and cosine similarity method is applied to generate similarity scores. For this approach, LDA studies thematic evaluation, T5 assess for semantic analysis of the student answer. A final score is given to each answer after a thorough review procedure using predetermined criteria. Experiments results in achieving an accuracy of 95%, precision of 94%, recall 95% and f1-score of 94% on training data by using the proposed model.

Values That Are Explicitly Present in Fairy Tales: Comparing Samples from German, Italian and Portuguese Traditions

Article

Jun 2024

Looking at how social values are represented in fairy tales can give insights about the variations in communication of values across cultures. We study how values are communicated in fairy tales from Portugal, Italy and Germany using a technique called word embedding with a compass to quantify vocabulary differences and commonalities. We study how these three national traditions differ in their explicit references to values. To do this, we specify a list of value-charged tokens, consider their word stems and analyse the distance between these in a bespoke pre-trained Word2Vec model. We triangulate and critically discuss the validity of the resulting hypotheses emerging from this quantitative model. Our claim is that this is a reusable and reproducible method for the study of the values explicitly referenced in historical corpora. Finally, our preliminary findings hint at a shared cultural understanding and the expression of values such as Benevolence, Conformity, and Universalism across the studied cultures, suggesting the potential existence of a pan-European cultural memory.

Systematic assessment on the remediation of Bisphenol A in the global environments: a mixed method analysis of research outputs

Article

Full-text available

Mar 2024

Bisphenol A (BPA) is an endocrine-disrupting compound and a mutagenic agent that poses health hazards to living organisms, making it a global contaminant. Several remediation techniques have been reported in the literature, however, a mixed-method science mapping analysis of research trends on BPA is still lacking. The present study aimed to investigate global research trends in BPA remediation. Published research papers on BPA remediation indexed in Web of Science, PubMed, and Scopus between 1992 and 2021 were analysed qualitatively and quantitatively using science mapping algorithms including Rstudio, bibliometrix package and R Version 4.2.1. The thematic areas were determined using k-means clustering of the author-keywords while Porter’s stemming algorithm was used to stemmed inflectional terms to their roots. Overall, 640 documents were published by 1903 authors with 2.07 authors/article and 0.336 article/author, 4.31 co-authors/article, an annual growth rate of 17.35% and a collaboration index of 2.99. Research productivity increased from 1 article in 1992 to 93 articles in 2021. The citations of the topmost 23 articles ranged from 365 to 109 and the total citation per year ranged from 45.6 to 27.3. China (n = 267, 41.7%), Japan (n = 53, 8.3%), USA (n = 33, 5.2%) and Korea (n = 28, 4.4%) were respectively the top four countries based on the total of published articles and overall citation. There were 48 relevant keywords dominated by Bisphenol A, adsorption, biodegradation, and peroximonosulphate. The present analysis identifies research accomplishment, focus and gaps on Bisphenol A remediation and offer the researchers the information needed to forecast future research priorities that can help policymakers and governments to internationalize collaborations and create research curricula that can remediate BPA on a global scale.

Opinion Target Extraction And Category Detection Using Artificial Intelligence Techniques In Aspect-based Opinion Mining - Özellik Tabanlı Görüş Madenciliğinde Yapay Zeka Teknikleri Kullanarak Görüş Hedefi̇ Çıkarımı Ve Kategori Tespiti

Thesis

Full-text available

Apr 2023

Kürşat Mustafa Karaoğlan

Recently, online review platforms have become significant data sources that support users' purchasing decisions. Users refer to these information sources to reach possible experiences before purchasing a product or service. On the other hand, businesses aim to benefit from the potential power of these resources to investigate the effects of the products they market on users. However, considering the volumetric size of these resources, it becomes almost impossible for users to evaluate them effectively by reading all the reviews. On the other hand, businesses that face large user populations need automated approaches in processes such as data processing and analysis. For this reason, researchers are interested in Aspect-based Opinion Mining studies, which enable more effective and fine-grained analysis by addressing the problems mentioned above. In this thesis, firstly, the Opinion Target Extraction (OTE) approach, including the pattern-based text pre-processing method, algorithms of extended syntactic-based relation rules with auxiliary components, and the majority voting method applied to provide performance optimization in model outputs, is proposed. It was provided to extract explicit expressions (opinion targets) representing distinctive entity aspects interpreted by users with subjective opinion words with this proposed approach. In order to test the effectiveness of the proposed OTE approach, experimental studies were carried out on a data set containing restaurant reviews. When the results of experimental studies were analyzed, it was reached that the proposed approach produced results comparable to supervised approaches in the literature and performed higher than other unsupervised approaches. Secondly, deep learning-based Aspect Category Detection (ACD) approaches are proposed to classify multi-label and hierarchically tagged review sentences with specific aspect categories. In the proposed ACD approaches, Convolutional Neural Network (CNN) and Deep Neural Network (DNN) based in which pre-trained Bidirectional Encoder Representations from Transformers (BERT) and Semantic Folding Theory (SFT) word embedding models (WEMs) that generate rich vector representations by considering contextual information are applied in their inputs multi-label text classification approaches have been developed. In order to analyze the effectiveness of the developed ACD approaches and their contribution to the classification performance of the implemented WEMs, experimental studies were carried out on laptop and restaurant review datasets. When the results of experimental studies were analyzed, higher or competitive performance results were obtained with the proposed approaches compared to other approaches in the literature—in addition, evaluating the performances of applied WEMs together is among the firsts in the literature. TR: Son zamanlarda, çevrimiçi inceleme platformları kullanıcıların satın alma kararlarını destekleyen önemli bilgi kaynakları haline gelmiştir. Kullanıcılar bir ürünü ya da hizmeti satın almadan önce olası deneyimlere ulaşmada bu bilgi kaynaklarına başvurmaktadır. İşletmeler ise pazarladıkları ürünlerin kullanıcılar üzerindeki etkilerini keşfedebilmek için bu kaynakların potansiyel gücünden yararlanmayı hedeflemektedir. Ancak bu kaynakların hacimsel büyüklüğü düşünüldüğünde; kullanıcıların tüm incelemeleri okuyarak etkin bir şekilde değerlendirmesi neredeyse imkânsız hale gelmektedir. Diğer bir taraftan büyük kullanıcı popülasyonuyla karşı karşıya kalan işletmeler, verilerin işlenmesi ve analizi gibi süreçlerde otomatikleştirilmiş yaklaşımlara ihtiyaç duymaktadır. Bu sebeple araştırmacılar, yukarıda bahsedilen problemleri ele alarak daha etkin ve detaylı analizlere olanak sağlayan Özellik Tabanlı Görüş (Fikir) Madenciliği çalışmalarına ilgi göstermektedir. Bu tez çalışmasında ilk olarak, örüntü tabanlı metin ön işleme yöntemine, yardımcı bileşenlerle genişletilmiş sözdizilimsel tabanlı ilişki kuralları algoritmalarına ve model çıktılarında performans optimizasyonu sağlamak amacıyla uygulanan çoğunlukla seçim yöntemine sahip Görüş Hedefi Çıkarımı (Opinion Target Extraction, OTE) yaklaşımı önerilmiştir. Önerilen bu yaklaşımla, kullanıcılar tarafından öznel nitelikli görüş sözcükleriyle yorumlanan varlığa ilişkin ayırt edici özellikleri temsil eden açık ifadelerin (görüş hedefleri) çıkarılması sağlanmıştır. Önerilen OTE yaklaşımının etkinliğini test etmek amacıyla, restoran incelemeleri içeren veri seti üzerinde deneysel çalışmalar gerçekleştirilmiştir. Deneysel çalışmaların sonuçları analiz edildiğinde, önerilen yaklaşımın literatürdeki denetimli yaklaşımlarla karşılaştırılabilir sonuçlar ürettiği, denetimsiz diğer yaklaşımlara göre ise daha yüksek performans gösterdiği sonucuna ulaşılmıştır. İkinci olarak, belirli özellik kategorileriyle çoklu ve hiyerarşik yapıda etiketlenmiş inceleme cümlelerinin sınıflandırılması amacıyla derin öğrenme tabanlı Özellik Kategorisi Tespiti (Aspect Category Detection, ACD) yaklaşımları önerilmiştir. Önerilen ACD yaklaşımlarında, girdilerinde bağlamsal bilgiyi dikkate alarak zengin vektör temsilleri üretebilen önceden eğitilmiş Dönüştürücülerden Çift Yönlü Kodlayıcı Temsilleri (BERT) ve Anlamsal Katlama Teorisi (SFT) kelime temsil modellerinin (word embedding model, WEM) uygulandığı Evrişimsel Sinir Ağı ve Derin Sinir Ağı tabanlı çok etiketli metin sınıflandırma yaklaşımları geliştirilmiştir. Geliştirilen ACD yaklaşımlarının etkinliklerini ve uygulanan WEM’lerin sınıflandırma performanslarına olan katkılarını analiz etmek için, dizüstü bilgisayar ve restoran inceleme veri setleri üzerinde deneysel çalışmalar gerçekleştirilmiştir. Deneysel çalışmaların sonuçları analiz edildiğinde, önerilen yaklaşımlarla literatürdeki diğer yaklaşımlara göre daha yüksek veya rekabetçi performans sonuçları elde edilmiştir. Ayrıca bu çalışmada uygulanan WEM’lerin performanslarının birlikte değerlendirilmesi, literatürde ilkler arasında yer almaktadır.

Enhancing Machine Learning Performance in Cyberbullying Detection through Hyperparameter Optimization

Conference Paper

Dec 2023

Arabic stems are being built in a database utilising an automated process.

Conference Paper

Nov 2023

Intelligent performance evaluation of e-commerce express services using machine learning: A case study with quantitative analysis A R T I C L E I N F O

Article

Apr 2024
EXPERT SYST APPL

Amidst the robust development of the service economy and information technology, the information age has significantly transformed consumption concepts and service demands. The requirements for logistics services from customers have become increasingly stringent. Beyond price and speed, quality has steadily emerged as the most crucial factor. When evaluating express services, experts primarily choose indicators from an enterprise perspective, mainly overlooking customer perception. The purpose of this paper is to establish a key index system for assessing the quality of express services aimed at improving the service quality of express deliveries within the development mode of intelligent logistics. In terms of methodology, we proposed a quantitative and qualitative e-commerce logistics service evaluation system. Specifically, we developed a key index system for assessing the quality of express services based on the theory of the six senses and the selection from an index library at first. And then, we ascertained the weight of each index in the evaluation by analyzing the proportion of customer comments. We employed analysis methods of machine learning, text data processing, mathematical statistics, and big data technology to determine the evaluation index and evaluation model of express service. Finally, our primary findings from the empirical verification and analysis of a real express enterprise indicate that this evaluation model can effectively assess the quality of express services. Our model can evaluate the current service quality and predict the quality of future services. We have derived some interpretations and conclusions based on these analyses and findings. Firstly, the level of customer perception is of great importance in the evaluation of express service quality. Secondly, our model can provide valuable feedback for express enterprises to improve their services. Lastly, we proposed corresponding improvement strategies to enhance the service quality of express enterprises under the intelligent logistics development mode.

An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Article

Full-text available

Jan 2023

The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.

Modelling a machine learning based multivariate content grading system for YouTube Tamil-post analysis

Article

Oct 2023
J INTELL FUZZY SYST

Writing is a crucial component of the language requirement and is an effective method for correctly reflecting language proficiency. Manually evaluating Tamil language exams becomes time-consuming and costly for standardized language administrators as they grow in popularity. Numerous studies on computerized English assessment systems have been conducted in recent years. Due to Tamil text’s complicated grammatical structures, less research has been done on computerized evaluation methods. In this research, we present a Tamil review comment analysis system using a novel multivariate naïve Bayes classifier (mv - NB) where the comments are acquired from an online social network and performed training using the database for further analysis. Experiments show that the graded Kappa of 0.4239, error rate of 2.55 and precision of 85% was achieved on the online dataset by our contents grading system, which is superior in grading compared to the other widely used machine learning algorithms training on big datasets. Our findings are promising. Additionally, our contents analysis may provide beneficial criticism on Tamil writing on YouTube posts including comments, spelling errors and morphological issues that help to analyze thelanguage correlation.

Chapter

Sep 2023

Every organization, both educational and non-educational, administers conducting tests to check the. Along with objective questions, the question papers used to assess a student’s performance include descriptive questions. In contrast to competitive exams, which include objective or multiple-choice questions, school exams typically include descriptive or subjective questions. Machines can quickly and readily assess the objective responses, which is highly helpful for saving time and money. However, because there is no automatic mechanism or computer to analyze student responses, schools and institutions have difficulties when analyzing descriptive questions. Due to the time and effort required by manual review, it is the end result since the evaluator's feelings have an effect on the evaluation's quality, which in turn has an effect on the student's performance score, there is also a potential of bias. Finding comparable words and phrases is the first step in assessing similarity. The project includes a Natural Language Processing (NLP), concepts, Deep learning as well as machine learning to demonstrate the development of autonomous subjective response assessment algorithms. In order to evaluate results in real-time datasets, this work analyses a variety of machine learning classifiers and deep learning concepts.KeywordsSubjective questionsDeep learningAutomatic language recognitionMachine learningSimilarity checking

Generation, implementation, and appraisal of an N-gram-based stemming algorithm

Article

Full-text available

Oct 2018

A language-independent stemmer has always been looked for. Single N-gram tokenization technique works well; however, it often generates stems that start with intermediate characters, rather than initial ones. We present a novel technique that takes the concept of N-gram stemming one step ahead and compare our method with an established algorithm in the field, say, Porter’s stemmer for English, Spanish, and Portuguese languages. Results indicate that our N-gram stemmer is comparable with the Porter’s linguistic stemmer.

An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach

Article

Full-text available

May 2018

Stemming is the basic operation in Natural language processing (NLP) to remove derivational and inflectional affixes without performing a morphological analysis. This practice is essential to extract the root or stem. In NLP domains, the stemmer is used to improve the process of information retrieval (IR), text classifications (TC), text mining (TM) and related applications. In particular, Urdu stemmers utilize only uni-gram words from the input text by ignoring bigrams, trigrams, and n-gram words. To improve the process and efficiency of stemming, bigrams and trigram words must be included. Despite this fact, there are a few developed methods for Urdu stemmers in the past studies. Therefore, in this paper, we proposed an improved Urdu stemmer, using hybrid approach divided into multi-step operation, to deal with unigram, bigram, and trigram features as well. To evaluate the proposed Urdu stemming method, we have used two corpora; word corpus and text corpus. Moreover, two different evaluation metrics have been applied to measure the performance of the proposed algorithm. The proposed algorithm achieved an accuracy of 92.97% and compression rate of 55%. These experimental results indicate that the proposed system can be used to increase the effectiveness and efficiency of the Urdu stemmer for better information retrieval and text mining applications.

Experimental Analysis of Stemming on Jurisprudential Documents Retrieval

Article

Full-text available

Jan 2018

Stemming algorithms are commonly used during textual preprocessing phase in order to reduce data dimensionality. However, this reduction presents different efficacy levels depending on the domain that it is applied to. Thus, for instance, there are reports in the literature that show the effect of stemming when applied to dictionaries or textual bases of news. On the other hand, we have not found any studies analyzing the impact of radicalization on Brazilian judicial jurisprudence, composed of decisions handed down by the judiciary, a fundamental instrument for law professionals to play their role. Thus, this work presents two complete experiments, showing the results obtained through the analysis and evaluation of the stemmers applied on real jurisprudential documents, originating from the Court of Justice of the State of Sergipe. In the first experiment, the results showed that, among the analyzed algorithms, the RSLP (Removedor de Sufixos da Lingua Portuguesa) possessed the greatest capacity of dimensionality reduction of the data. In the second one, through the evaluation of the stemming algorithms on the legal documents retrieval, the RSLP-S (Removedor de Sufixos da Lingua Portuguesa Singular) and UniNE (University of Neuchâtel), less aggressive stemmers, presented the best cost-benefit ratio, since they reduced the dimensionality of the data and increased the effectiveness of the information retrieval evaluation metrics in one of analyzed collections.

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

Article

Jan 2019

The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.

Enhancement of Detecting Wicked Website Through Intelligent Methods

Chapter

Sep 2016

Tarik A. Rashid

Noticeably, different environments of wicked website include different types of information which could be a threat for all web users such as incitement for hacking sites and encouraging them for spreading notions through learning theft networks, Wi-Fi, websites, internet forums, Facebook, email accounts, etc. The proposed work deals with sites to protect from hacking through designing a method that takes full advantage of machine learning and intelligent systems’ capabilities to realize the informative contents. The ultimate goal of this work of research is to understand the system behavior and determine the best solution to secure the vulnerable users, state and society via Random Forest (RF) and Support Vector Machines (SVM) methods instead of traditional methods. Random Forest exhibited Promising Results in terms of accuracy.

The Rule-Based Sundanese Stemmer

Article

Jul 2018

Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a morphological process. Moreover, most of Sundanese affixes are derivational, so removing the derivational affix as the first step is reasonable. To handle ambiguity, the last recognized affix was returned as the result. As the baseline, a Confix-Stripping Approach that applies Porter Stemmer for the Indonesian language was used. This stemmer shares similarities in terms of affix type, but uses a different stemming order. To observe whether the baseline stems the Sundanese affixed word properly, some features that were not covered by the baseline, such as the infix and allomorph removal, were added. The evaluation was done using 4,453 unique affixed words collected from Sundanese online magazines. The experiment shows that, as a whole, our stemmer outperforms the modified baseline in terms of recognized affixed type accuracy and properly stemmed affixed words. Our stemmer recognized 68.87% of the Sundanese affixed types and produced 96.79% of the correctly affixed words; the modified baseline resulted in 21.70% and 71.59%, respectively

A Cognitive Inspired Unsupervised Language-Independent Text Stemmer for Information Retrieval

Article

Jul 2018
COGN SYST RES

In Information Retrieval systems, stemming handles the words that can occur in different morphological forms, and hence matches the terms of the documents and the queries that are related in meanings. In this article, we have proposed a cognitive inspired language-independent stemming that learns group of morphologically related words from the ambient corpus without any linguistic knowledge or human intervention and it behaves in a way the human brain works. The main idea of our proposed algorithm is to determine only those variants of the words from the ambient corpus that match the original intent of the query terms. We conducted ad-hoc retrieval experiments in a number of languages of varying morphological complexity using standard TREC, FIRE, and CLEF document collection. The results indicate that stemming improves the retrieval accuracy and the effectiveness of stemming algorithm increases with the increase in the morphological complexity of algorithm. The results also indicates that the performance of our proposed algorithm is better than the stemmers based on linguistic knowledge and other state-of-the-art statistical stemmers in almost all the languages under study. In multi-lingual setup these results are quite encouraging.

Templatic morphology as an emergent property: Roots and functional heads in Hebrew

Article

Jul 2018

Itamar Kastner

Modern Hebrew exhibits a non-concatenative morphology of consonantal “roots” and melodic “templates” that is typical of Semitic languages. Even though this kind of non-concatenative morphology is well known, it is only partly understood. In particular, theories differ in what counts as a morpheme: the root, the template, both, or neither. Accordingly, theories differ as to what representations learners must posit and what processes generate the eventual surface forms. In this paper I present a theory of morphology and allomorphy that combines lexical roots with syntactic functional heads, improving on previous analyses of root-and-pattern morphology. Verbal templates are here argued to emerge from the combination of syntactic elements, constrained by the general phonology of the language, rather than from some inherent difference between Semitic morphology and that of other languages. This way of generating morphological structure fleshes out a theory of morphophonological alternations that are non-adjacent on the surface but are local underlyingly; with these tools it is possible to identify where lexical exceptionality shows its effects and how it is reined in by the grammar. The Semitic root is thus analogous to lexical roots in other languages, storing idiosyncratic phonological and semantic information but respecting the syntactic structure in which it is embedded.

Improving Kurdish Web Mining through Tree Data Structure and Porter’s Stemmer Algorithms

Article

Jul 2018

Stemming is one of the main important preprocessing techniques that can be used to enhance the accuracy of text classification. The key purpose of using the stemming is combining the number of words that have same stem to decrease high dimensionality of feature space. Reducing feature space cause to decline time to construct a model and minimize the memory space. In this paper, a new stemming approach is explored for enhancing Kurdish text classification performance. Tree data structure and Porter’s stemmer algorithms are incorporated for building the proposed approach. The system is assessed through using Support Vector Machine (SVM) and Decision Tree (C4.5) to illustrate the performance of the suggested stemmer after and before applying it. Furthermore, the usefulness of using stop words are considered before and after implementing the suggested approach.

A Study of Graph Based Stemmer in Arabic Extrinsic Plagiarism Detection

Conference Paper

Mar 2018

Arabic stemming as a technique of Natural Language Processing is increasingly becoming a significant research domain since Arabic is one of the most challengeing laguages. In this study, a new graph based-approach for stemming in Arabic documents was proposed. Moreover, an evaluation the impact of this stemmer on extrinsic plagiarism detection was elaborated. In this approach, a word is represented by a directed weighted graph having a set of connected components. Each of these components has a specific representation. Then, a stem is selected by comparing the word's representation with a database of 450 stems. This stemmer showed efficiency by improving the detection process of extrinsic plagiarism which is proved by the results obtained.

Empirical evaluation and study of text stemming algorithms

Abstract and Figures

Recommended publications

A comparative review of Urdu stemmers: Approaches and challenges

An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach

High performance stemming algorithm to handle multi-level inflections in Urdu Language

High performance stemming algorithm to handle multi-level inflections in Urdu Language