Conference PaperPDF Available

Query Expansion in Information Retrieval for Urdu Language

March 2018

March 2018

DOI:10.1109/INFRKM.2018.8464762

Conference: 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP)

Authors:

Imran Rasheed

K L University

Haider Banka

Indian Institute of Technology (ISM) Dhanbad

Content uploaded by Imran Rasheed

Content may be subject to copyright.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Query Expansion in Information Retrieval for Urdu

Language

Imran Rasheed

Department of CSE

Indian Institute of Technology (ISM)

Dhanbad, India

imranrasheed@cse.ism.ac.in

Haider Banka

Department of CSE

Indian Institute of Technology (ISM)

Dhanbad, India

haider.banka@gmail.com

Abstract—The information retrieval system need to be

upgraded constantly to meet the challenges posed by the

advanced user queries as the search system becoming more

sophisticated with time. These problems have been addressed

extensively in recent times in several research communities to

achieve quick and relevant outcome. One such approach is to

augment the query where the automatic query expansion

increases the precision in information retrieval even if it can cut

down the results for some queries. Here, the above approach was

tested with the present Urdu data collection obtained via

different expansion models such as KL, Bo1 and Bo2. The

current collection is quite large in size compared to other existing

Urdu datasets. It comprises of 85,304 documents in a TREC

schemes and 52 topics with their relevance assessment. In this

paper we emphasize to enhance the retrieval model using the

query expansion which is never done before on Urdu text.

However, we show that a deep analysis of initial and expanded

queries brings fascinating insights that could avail future

research in the domain.

Keywords—Urdu Information Retrieval, Query Expansion;

Relevance Feedback, Local Analysis, Urdu Corpus, Relevance

Judgment.

I. INTRODUCTION

In the last fifty years, a massive progress is being reported in

Information Retrieval (IR) area but most of it was carried out in

English [1]. Beside this, a number of IR communities (like

TREC, CLEF, NTCIR and FIRE) have taken noticeable

initiatives in several East-Asian, European and South-Asian

languages. However, there are still number of languages where

nothing progressive has been reported when compared with

aforementioned languages and Urdu is being one among them

and its poor response is mainly attributed to the lack of

available linguistic resources. It is remarkable to note that in

the last decade, an ample work is reported in different South

Asian languages mostly comprising the area of machine

learning and natural language processing but Urdu is still quite

distant away from all such initiatives [2]. Additionally, for any

linguistic study, the benchmark dataset of that given language

is a basic necessity for performing the advance research

experiments. Few large datasets of news genre is for general

interest have been constructed for English and other European

and Asian languages are publicly available [3, 4]. On the other

hand, the available standard data in case of Urdu is scarce and

limited to only domain specific. For example, Becker-Riaz

corpus contains about 7000 short news articles gathered from

BBC news [5], EMILLE corpus includes 1,640,000 words of

Urdu text and 512,000 words of spoken Urdu [6], IJCNLP-

2008 NE tagged corpus has 40,000 words for named entity

recognition [7] and CLE datasets have 100,000 words used for

POS tagging and named entity recognition [8] are the only

Urdu corpora freely available in public. All these datasets are

supposedly smaller in size and are not suitable enough to

accomplish various Urdu information retrieval tasks such as

text summarization, text clustering, Ad-hoc information

retrieval, question-answering, query expansion, categorization

etc. However, the authors of [9], recently developed a NE

tagged dataset used only for NER analysis. So, the present

work is undertaken to have a constructive and a prominent

Urdu database which can readily be used later for different

linguistic study. In [10], the authors build an efficient

QESBIRM platform that combines QE and proximity SBIRM

approaches to significantly enhance information retrieval

efficiency. Here SBIRM method will be implemented using

either DWT, KLD or P-WNET methods. In [11] the authors

gives an overview of the information retrieval models based on

query expansion. The description also includes some

explanation of the practical undertaken and their methods of

implementation. According to reference [12] investigated the

combined approach of the association and distribution based

term selection methods to advance the overall retrieval

effectiveness. In another place the authors [13] reported

positive results in text retrieval using WordNet-based query

expansion. More recently, the authors [11] used three different

types of spectral analysis based on semantic segmentation are

carried out namely: sentence-based segmentation, paragraph-

based segmentation and fixed length segmentation to improve

ranking score as well as improve the run-time efficiency to

resolve the query, and maintain a reasonable space for the

index.

A Urdu text collection consist of 85,304 newswire documents

for general interest based on Text Retrieval Conference

(TREC) standard, also includes 52 topics from different

2018 Fourth International Conference on Information Retrieval and Knowledge Management

171

categories and their relevance judgment. Here, we examine the

efficiency of different query expansion (QE) models like KL,

Bo1 and Bo2 on our Urdu text collection. The paper is

organized as follows: Section II describes some challenges in

Urdu language. Section III outlines some related work and the

state-of-the-art approaches to query expansion whereas Section

IV, describes the brief summary of the collection statistics and

Section V reviews some query expansion model. Section VI

and Section VII evaluate some experiments to investigate the

retrieval effectiveness using query expansion. The retrieval

results are described in Section VIII. Section IX concludes this

paper and suggests some avenues for future work.

II. THE URDU LANGUAGE

Urdu being a National Language of Pakistan is spoken well

over by 300 million people worldwide [7, 14] including a large

chunk of speakers from Indian sub-continent. It is well adopted

by Bollywood, the famous Indian cinema which introduce it to

large non Urdu speaking populations including Middle East,

Africa and several European and Latin American regions. Urdu

is very much alike with Hindi language but distinct in script

writing. Both languages share common vocabulary and

grammar of daily speech [15, 16]. Urdu morphology,

orthography, word tokenization, word space and poor-

resources are number of challenges that makes it more

challenging among the languages that follows Arabic script.

Therefore, it offers a unique yet significant attraction to the

linguistic research world.

A. Orthography

The Urdu script is typically written from right to left like

Arabic script in Nastaliq style [4]. Most of the characters

acquires a different shapes depending on their position in the

ligature e.g. a letter may appear differently depending on its

position in a word such as occurring in the beginning, end,

center or appearing totally isolated [17].

B. Morphology

Morphology is the study of word organization [18].

Urdu

is morphologically rich in nature that means multiple words

can be extracted from a single source [19]. The following

Table I shows the few variants that exists from the single word

in Urdu.

TABLE I. D

IFFEREN T

ARIANTS OF

ORD

PLAY

Play ϞϴϬ̯

You Play

ϞϴϬ̯ ϮΗ

؏ϮϠϴϬ̯ ϢΗ

ΎϨϠϴϬ̯ ϢΗ

ؐϴϠϴϬ̯ ̟΁

كϴϠϴϬ̯ ̟΁

؏ϮϠϴϬ̯ ̟΁

فΘϠϴϬ̯ ̟΁

C. Tokenization and Word Segmentation

Tokenization is a process of breaking a text at word (or

root) level. Unlike in English text where two distinct words

are generally found separated by a space (tokens), Urdu on the

other hand, faces tokenization and word space problems [20].

Moreover, the use of Space in Urdu language is quite bugging

when viewed from information retrieval perspective. Here, the

space is not used after each single word, instead, it is only

used after the word is joined with an appropriate suffixes.

Therefore, enquiries made can often result in unwanted

outcome which left a huge space for researchers to overhaul

the present tokenization issues in Urdu language.

D. Diction Problem

Diction related issues correspond to using word from a set

of words having same meanings. The Urdu language is full of

such diction problems due to its computationally complex

nature. For example, words in Urdu that are different in

spelling but having same meanings are: Ground (տϧ΅΍ή̳ ˬտϧϭ΍ή̳).

E. Loan Words

The Urdu language comprises many loan words from

several foreign languages including English, Arabic, Persian

and Turkish. Urdu is relatively a new language in contrast to a

historically rich languages such as Arabic, Persian and Turkish

and it is coined mainly to bridge a gap between speakers of

afore-mentioned languages and native North Indian languages

mainly Hindi. Over a course of time, Urdu saw an

embracement of several words from other languages such as

English which is routinely used not only in spoken Urdu but

also in written Urdu. Some examples in Urdu language

borrowed from English are given as follows: Promotion

(ϦηϮϣϭή̡), Police (βϴϟϮ̡), and Company (̶Ϩ̢Ϥ̯).

III. RELATED WORK

Queries given by users to the information retrieval system

are usually short, ambiguous and imperfect [21] that mostly

returns an inaccurate response to the queries addressed. Many

essential terms can be absent from the given query that might

lead search engine or information retrieval system to respond

poorly or ineffectively resulting in documents having less

relevance to the queries made. This problem is first addressed

by Rocchio [22], who later proposed a Relevance Feedback

system that automatically expand the original queries by adding

some extra terms such as synonyms, plurals, modifiers etc. as

per user feedback or query reformulation [23, 24].

Query expansion is widely practiced methodology to

significantly enhance the user’s experience in preferred

information response [25-28]. Query Expansion is basically a

process of adding significant and contextually related words to

reformulate the seed query to enhance retrieval performance in

information retrieval operations. Several approaches have been

proposed by researchers for query expansion operations but

among all the approaches addressed, Pseudo-Relevance

Feedback (PRF), Automatic Query Expansion and Blind

Relevance Feedback are observed to be the most effective and

useful in data retrieval [27]. In this approach, an original query

is initially fired using typical information retrieval system [29-

31]. Then, the related terms are extracted from the seed query.

For example, the top K ranked documents that are returned in

the first attempt of data retrieval may contain some important

2018 Fourth International Conference on Information Retrieval and Knowledge Management

172

terms that can ease to separate relevant documents from the

irrelevant ones. In general, the obtained expansion terms are

either according to the most frequent terms in the feedback

documents or according to the most specific terms found in the

feedback documents or within the entire document collection.

Fig. 1 describes a typical work flow of how PRF based

information retrieval systems work. Several methods have been

discovered by researchers for QE. In this work, three QE

models namely Bo1, Bo2 and KL are adopted for the analysis.

Furthermore, a comparative study on the retrieval effectiveness

of state-of-the-art retrieval models on Urdu languages is

discussed in this paper and the effectiveness of applying QE to

improve the retrieval accuracy is explored. Terrier is used as

information retrieval framework for all the experiments

undertaken in this work [32].

Fig. 1. Architecture of Pseudo Relevance Feedback based System.

IV. DOCUMENT COLLECTION

A. Documents, Topics, and Relevance Assessments

To measure the efficiency of different query expansion

models described in this paper are based on data from our own

developed Urdu dataset. The collection consists of 85,304

documents based on TREC specifications. All documents in the

collection are in UTF-8 encoding scheme. A sample document

in standard TREC format is shown in Fig. 2.

<DOC>

<DOCNO>26_July_2012_Sportz4</DOCNO>

<TITLE> ՋδϴՌ ̲Ϩ̰Ϩϳέ :ϢηΎ٫ ؟ϠϣԸ΍ ـήδϴΗ ήΒϤϧ ή̡ فΌ̳Ը΍</TITLE>

<TEXT>

̶ΑϮϨΟ ΎϘϳήϓ΍ ف̯ ̮ϴΟ βϠϴ̯ έϭ΍ ϢηΎ٫ ؟ϠϣԸ΍ ̶΋Ը΍ ̶γ ̶γ ՋδϴՌ ̰Ϩϳέ̲Ϩ ؐϴϣ

ـήγϭΩ έϭ΍ ـήδϴΗ ήΒϤϧ ή̡ فΌ̳Ը΍ لؐϴ٫ ̶ϤϟΎϋ ̲Ϩ̰Ϩϳέ ؐϴϣ Ϭ̩ ؐϴϣ γف έΎ̩

̟ΎՌ ϦθϳίϮ̡ ̶ΑϮϨΟ

ΎϘϳήϓ΍ ف̯ ؏ϮϨϴϤδՍϴΑ ف̯ αΎ̡ لف٫ ̵ήγ ̰ϨϟΎ ف̯ έΎϤ̯

΍έΎ̯Ύ̴Ϩγ έϮΘγΪΑ ̟ΎՌ ή̡ لؐϴ٫ ̮ϴΟ βϠϴ̯ ˬـήγϭΩ ϢηΎ٫ ؟ϠϣԸ΍ Ϟ̡ήՌ γ̵ή̪Ϩ

έϮ̰γ΍ فϧή̯ ف̯ ΪόΑ ˬـήδϴΗ έΪϨ̩ ϝΎ̡ ˬفϬΗϮ̩ ̵΍ ̶Α ̵վ ίήΌϴϠϳϭ ؐϳϮ̪ϧΎ̡

έϭ΍ Ϣϳή̳ ϬΘϤγ΍ فՍϬ̩ ήΒϤϧ ή̡ لؐϴ٫ ̵ϮϠϳήՍγԸ΍ Ϟ̰ϴ΋Ύϣ ̭έϼ̯ Ύ̯ Ϥϧ

ήΒ ؏΍ϮΗΎγ

لف٫ ϥΎΘδ̯Ύ̡ ف̯ βϧϮϳ ϥΎΧ ˬؐϳϮϧ ή٬υ΍ ̶Ϡϋ ؐϳϮ٫έΎϴ̳ έϭ΍ ΡΎΒμϣ Τϟ΍ϖ

ؐϳϮ٫ήϴΗ ήΒϤϧ ή̡ لؐϴ٫ ϩϭ ϝԸ΍ ίέտϧԹ

ϭ΍έ ؐϴϣ ؟Ϡ̴ϨΑ ̶θϳΩ ΐϴ̰η ϦδΤϟ΍ ̶̯ ؟̴Ο

̟ΎՌ ή̡ فΌ̳Ը΍ لؐϴ٫ ̲ϨϟϮΑ ؐϴϣ Ϟϳվ ϦϴՍγ΍ 896 βՍϨ΋΍Ϯ̡ ف̯ ϬΗΎγ ̟ΎՌ ή̡ ϴ٫لؐ

̶ϧΎΘδ̯Ύ̡ ϑԸ΍ ήϨ̢γ΍

Ϊϴόγ ϞϤΟ΍ Ύ̯ ήΒϤϧ ΍ήγϭΩ لف٫ Ջϔϴϟ ϡέԸ΍ ήϨ̢γ΍ ϋϦϤΣήϟ΍ΪΒ

ؐϳϮγΩ ήΒϤϧ ή̡ لؐϴ٫

</TEXT>

</DOC>

Fig. 2. A Sample file in TREC format.

A topic resembles a standards common with other retrieval

initiatives such as TREC. Any ordinary topic has three

sections, i.e., title, description, and the narrative section. In the

given process, a unique identification number is assigned to the

topic which separates it out from the other similar topics. One

such example is given in Fig. 3. In addition, a set of 52 topics

with their relevance assessment have been used for the analysis

of query expansion.

<topic>

<title>

εέΩ΁ ̲Ϩγ΅Ύ٫ ̶Ս΋ΎγϮγ فϟΎՌϮϬ̳ ̶ϔόΘγ΍

</title>

<desc>

εέΩ΁ ̲Ϩγ΅Ύ٫ ̶Ս΋ΎγϮγ فϟΎՌϮϬ̳ ف̯ ΪόΑ ήϳίϭ ̶Ϡϋ΍ ̶ϔόΘγ΍

</desc>

<narr>

ϳϭΎΘγΩ ؟ϘϠόΘϣ فγ ̶ϔόΘγ΍ Ύ̯ ϥ΍ϭΎ̩ ̭Ϯη΍ ̶Ϡϋ΍ ήϳίϭ ف̯ ήՍη΍έΎ٬ϣϧϮ٫ Ε΍ΰف

فΌ٫Ύ̩ . Ϯ̯ Ε΍ΰϳϭΎΘγΩεέΩ΁ ̲Ϩγ΅Ύ٫ ̶Ս΋ΎγϮγ فϟΎՌϮϬ̳ ف̯ ΪόΑ ϳίϭή ̶Ϡϋ΍

̶ϔόΘγ΍ ΙϮϠϣ ؐϴϣفΌ٫Ύ̩ ΎϧϮ٫ ϞϤΘθϣ ή̡ ΕΎϣϮϠόϣ ϖϠόΘϣ فγ فϧϮ٫. ̳فϟΎՌϮϬ

ϦϴϣΎπϣ ف̯ ؏ϭήΒΧ ؐϴϣ ـέΎΑ ف̯ ؏΅ϮϠ٬̡ ή̴ϳΩ ف̯)ϳέΎΘϓή̳ ή̴ϳΩ ؏Ϯ /

̶ϔόΘγ΍ ϩήϴϏϭ ϩήϴϏϭ (ؐϴ٫ ؟ϘϠόΘϣ ήϴϏ

</narr>

</topic>

Fig. 3. A Sample topic (Topic 2, “Aadarsh Housing Scams”).

V. QUERY EXPANSION MODELS

In the given analysis, DFR-based term weighing models

namely, Bo1 [33], Bo2 [33] and KL are employed by using

Terrier’s search engine. Terrier employs a Divergence from

Randomness based QE mechanism which is a generalization of

Rocchio’s method [26]. Initially, the DFR model measure the

weightage of the terms from the top ranked documents. The

most essential terms are then collected from returned results

and added with the original query so as to generate an

expanded query. The above mentioned weighting schemes

models are shown in the sections V-A, V-B, V-C [34].

A. Kullback-Leibler (KL) Model

Kullback-Liebler divergence computes the divergence

between the probability distributions of terms in the whole

collection and in the top ranked documents to obtain the first

pass retrieval using the original user query [35]. For the term t,

this divergence is given by

ݓሺݐሻൌܲ

௡

ሺݐሻכ݈݋݃

ଶ௉

೙

ሺ௧ሻ

௉

೘

ሺ௧ሻ

(1)

௡

ሺݐሻൌ

σ௧௙ሺ௧ǡௗሻ

೏א೙

σσ ௧௙ሺ௧

ᇲ

ǡௗሻ

೟ᇲא೏

೏א೙

(2)

௠

ሺݐሻൌ

σ௧௙ሺ௧ǡௗሻ

೏א೘

σσ௧௙ሺ௧

ᇲ

ǡௗሻ

೟ᇲא೏

೏א೘

(3)

Where

•

௡

(t) is the probability of the term t in the top ranked

documents n.

•

௠

(t) is the probability of the term t in the whole

collection.

B. Bose-Einstein1 (Bo1) Model

This model is based on the Bose Einstein Statistic and the

weight of the term t in the top ranked documents (rank ranging

from 3 to 10) is given by [36].

2018 Fourth International Conference on Information Retrieval and Knowledge Management

173

ݓሺݐሻൌσݐ݂ሺݐǡ݀ሻכ݈݋݃

ଶሺଵା௉

೎

ሻ

௉

೎

൅݈݋݃

ଶ

ሺͳ൅ܲ

௖

ሻ

ௗא௡

(4)

௖

ൌ

σ௧௙ሺ௧ǡௗሻ

೏א೘

ே

(5)

where equation (5) denotes the average term frequency of t in

the collection (N is the number of documents in the collection).

C. Bose-Einstein2 (Bo2) Model

The scoring formula of Bo2 is given by

ݓሺݐሻൌݐ݂

௫

ή݈݋݃

ଶሺଵା௉

೎

ሻ

௉

೑

൅݈݋݃

ଶ

൫ͳ ൅ ܲ

௙

൯ (6)

•

ݐ݂

௫

is the frequency of the term in the top-returned

documents.

•

௖

is given by

ி

ே

, where F is the term frequency of the

query term in the collection and N is the number

documents in the collection.

•

௙

ሺݔሻൌ

௧௙

ೣ

ή௟

ೣ

௧௢௞௘௡

೎

is the probability of the term t in the

whole collection. where lx is the sum of the length of

the exp_doc top ranked documents where exp_doc is a

parameter of the query expansion methodology. F, is

the term frequency of the query term in the whole

collection.

•

௫

is the size in tokens.

•

ݐ݋݇݁݊

௖

is the total number of tokens in the collection.

VI. EXPERIMENT ARCHITECTURE

In this paper, the performance of Urdu collection based on

different query expansion methods is measured. At first, the

search with Okapi (BM25) model [22] was carried out as a

weighting scheme using only the title field of the original

queries. This method is chosen as benchmark to enhance the

search results by expanding each topic with similar concepts

as shown in Table II. The expanded terms assist the process of

matching the relevant documents to the associated query.

Moreover, it ease to reduce the discrepancy between the

documents and the queries [37]. Several experiments were

performed, using a set of 52 queries with Okapi (BM25)

model based on weighting scheme. Here, Rocchio beta = 0.4,

and retrieval parameter b = 0.4 gives the best results for query

expansion. The evaluation is performed with Terrier

information retrieval framework that was found to be quite

effective in indexing, retrieval and evaluation of English and

other Non-English documents. In order to evaluate the results

of the retrieval process, a program inside the TREC

conference, trec_eval, is used. Trec_Eval is quite significant in

evaluation of different measures such as the total number of

documents (Retrieved, Relevant and Rel_ret (relevant, and

retrieved)) over all queries or MAP, R-prec, and Interpolated

Recall-Precision Averages. Thus in evaluation of the retrieval

performance, the MAP (mean average precision) measure is

chosen in this experiment, where its value is computed based

on (maximum) 100 retrieved documents per query.

TABLE II. MAP,

R-

PREC

K OF

BM25

Okapi BM25

Mean Average Precision

0.3162

R-precision 0.2990

P@10 0.3308

P@20 0.2702

P@100 0.1360

VII. EXPERIMENTS FOR QUERY EXPANSION WITH

BO1, BO2 & KL MODELS

In these experiments, Rocchio's approach is adopted to

perform the enhancement of the Okapi (BM25) retrieval

method. Initially, 5, 10 or 15 top retrieved documents were

chosen, from each set of retrieved documents 5, 10, 15, 30 or

50 terms were extracted. These terms were then added to

original queries and examine whether the results are

significantly different. Top 100 documents are retrieved

initially using the retrieval model. Thereafter, the model is

modified with Bo1, Bo2 and KL expansion models and is also

analyzed with MAP methodology for further improvement in

the result. Table III shows the results obtained after query

expansion.

VIII. RESULTS AND DISCUSSIONS

In this work, the baseline text retrieval is compared with the

expansion variants as discussed in section V. Table III

presented the results obtained on Urdu dataset which shows

that the highest MAP is found for BM25 (b=1.0) enhanced by

Bo1 model having an overall (+25.55%) improvement

compared to baseline text retrieval method when the number of

documents and selected terms were 10 and 15 respectively.

Likewise, the highest MAP for BM25 (b=1.0) enhanced with

KL expansion model shows a (+25.68%) growth over baseline

when the number of documents is 10 and the number of

selected terms are 5. Moreover, in all the cases addressed, the

Bo1 and KL models performed almost similarly whereas the

results obtained with Bo2 is not appreciable because lesser

number of relevant documents are retrieved due to the term

mismatch issues between the original query terms and the

candidate expansion terms.

IX. CONCLUSION AND FUTURE WORK

The focus of the present work is to observe the effect of

query expansion on the given Urdu dataset which has not been

addressed before on such a scale. The results show that KL

model performed extremely well in comparison to other

expansion models such as Bo1 and Bo2 on the present Urdu

data collection. In addition to this, the Kullback-Leibler model

was found to significantly enhance the MAP of the retrieved

data by almost 22-24% in the above study. In future, external

resource like WordNet approach to Bo2 model will be

undertaken to enhance the efficiency in Urdu information

retrieval.

2018 Fourth International Conference on Information Retrieval and Knowledge Management

174

ACKNOWLEDGMENT

We sincerely thank to Mr. Zahoor Ahmad Shora chief editor

`Daily Roshni' for his generous contribution in freely sharing

raw data for the collection and also I give thanks to Mr.

Hamaid Mehmood for his guidance and kind support.

REFERENCES

[1] A. Singhal, “Modern information retrieval: A brief overview,” IEEE

Data Eng. Bull., vol. 24, no. 4, pp. 35–43, 2001.

[2] P. Majumdar, M. Mitra, S. K. Parui, and P. Bhattacharya, “Initiative for

indian language ir evaluation,” 2007.

[3] E. Darrudi, M. R. Hejazi, and F. Oroumchian, “Assessment of a modern

farsi corpus,” in Proceedings of the 2nd Workshop on Information

Technology & its Disciplines (WITID), 2004.

[4] A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,”

Artificial Intelligence Review, pp. 1–33, 2016.

[5] D. Becker and K. Riaz, “A study in urdu corpus construction,” in

Proceedings of the 3rd workshop on Asian language resources and

international standardization-Volume 12. Association for Computational

Linguistics, 2002, pp. 1–5.

[6] A. Hardie, “Developing a tagset for automated part-of-speech tagging in

urdu.” in Corpus Linguistics 2003, 2003.

[7] S. Hussain, “Resources for urdu language processing.” in IJCNLP, 2008,

pp. 99–100.

[8] S. Urooj, S. Hussain, F. Adeeba, F. Jabeen, and R. Parveen, “Cle urdu

digest corpus,” LANGUAGE & TECHNOLOGY, vol. 47, 2012.

[9] W. Khana, A. Daudb, J. A. Nasira, and T. Amjada, “Named entity

dataset for urdu named entity recognition task,” Organization, vol. 48, p.

282.

[10] S. Alnofaie, M. Dahab, and M. Kamal, “A novel information retrieval

approach using query expansion and spectral-based,” information

retrieval, vol. 7, no. 9, 2016.

[11] M. Y. Dahab, M. Kamel, and S. Alnofaie, “Further investigations for

documents information retrieval based on dwt,” in International

Conference on Advanced Intelligent Systems and Informatics. Springer,

2016, pp. 3–11.

[12] D Pal, M Mitra, and K Datta, “Query expansion using term distribution

and term association,” arXiv preprint arXiv:1303.0667, 2013.

[13] D Pal, M Mitra, K Datta, “Improving query expansion using wordnet,”

Journal of the Association for Information Science and Technology, vol.

65, no. 12, pp. 2469–2478, 2014.

[14] K Riaz, “Baseline for urdu ir evaluation,” in Proceedings of the 2

ACM

workshop on Improving non english web searching. ACM, 2008, pp. 97–

100.

[15] K Riaz, “Urdu is not hindi for information access,” in Workshop on

Multilingual Information Access, SIGIR, 2009.

[16] K Riaz, “Comparison of hindi and urdu in computational context,” Int J

Comput Linguist Nat Lang Process, vol. 1, no. 3, pp. 92–97, 2012.

[17] M. I. Razzak, “Online urdu character recognition in unconstrained

environment,” Ph.D. dissertation, PhD thesis, International Islamic Uni-

versity, Islamabad, 2011.

[18] V. Gupta, N. Joshi, and I. Mathur, “Design & development of rule based

inflectional and derivational urdu stemmer usal,” in Futuristic Trends on

Computational Analysis and Knowledge Management (ABLAZE), 2015

International Conference on. IEEE, 2015, pp. 7–12.

[19] S. Iqbal, M. W. Anwar, U. I. Bajwa, and Z. Rehman, “Urdu spell

checking: Reverse edit distance approach,” in Proceedings of the 4

Workshop on South and Southeast Asian Natural Language Processing,

2013, pp. 58–65.

[20] S. Stymne, “Spell checking techniques for replacement of unknown

words and data cleaning for haitian creole sms translation,” in

Proceedings of the Sixth Workshop on Statistical Machine Translation.

Association for Computational Linguistics, 2011, pp. 470–477.

[21] A. Spink, D. Wolfram, M. B. Jansen, and T. Saracevic, “Searching the

web: The public and their queries,” Journal of the Association for

Information Science and Technology, vol. 52, no. 3, pp. 226–234, 2001.

[22] S. E. Robertson, “The probability ranking principle in ir,” Journal of

documentation, vol. 33, no. 4, pp. 294–304, 1977.

[23] F. Diaz, “Pseudo-query reformulation,” in European Conference on

Information Retrieval. Springer, 2016, pp. 521–532.

[24] G. Salton and C. Buckley, “Improving retrieval performance by

relevance feedback,” Readings in information retrieval, vol. 24, no. 5,

pp. 355–363, 1997.

[25] D. Metzler and W. B. Croft, “Latent concept expansion using markov

random fields,” in Proceedings of the 30th annual international ACM

SIGIR conference on Research and development in information

retrieval. ACM, 2007, pp. 311–318.

[26] J. J. Rocchio, “Relevance feedback in information retrieval,” The Smart

retrieval system-experiments in automatic document processing, 1971.

[27] J. Xu and W. B. Croft, “Quary expansion using local and global

document analysis,” in ACM SIGIR Forum, vol. 51, no. 2. ACM, 2017,

pp. 168–175.

[28] C. Zhai and J. Lafferty, “Model-based feedback in the language

modeling approach to information retrieval,” in Proceedings of the tenth

international conference on Information and knowledge management.

ACM, 2001, pp. 403–410.

[29] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan,

“Searching the web,” ACM Transactions on Internet Technology

(TOIT), vol. 1, no. 1, pp. 2–43, 2001.

[30] R. Baeza-Yates and B. Ribeiro-Neto, “Modern information retrieval

addison-wesley longman,” Reading MA, 1999.

[31] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes:

compressing and indexing documents and images. Morgan Kaufmann,

1999.

[32] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma,

“Terrier: A high performance and scalable information retrieval

platform,” in Proceedings of the OSIR Workshop, 2006, pp. 18–25.

[33] G. Amati, “Probability models for information retrieval based on

divergence from randomness,” Ph.D. dissertation, University of

Glasgow, 2003.

[34] V. Plachouras, B. He, and I. Ounis, “University of glasgow at trec 2004:

Experiments in web, robust, and terabyte tracks with terrier.” in TREC,

2004.

[35] T. Cover and J. Thomas, “Elements of information theory wiley new

york,” NY Google Scholar, 1991.

[36] C. Macdonald, B. He, V. Plachouras, and I. Ounis, “University of

glasgow at trec 2005: Experiments in terabyte and enterprise tracks with

terrier.” in TREC, 2005.

[37] M. Shokouhi and J. Zobel, “Robust result merging using sample-based

score estimates,” ACM Transactions on Information Systems (TOIS),

vol. 27, no. 3, p. 14, 2009.

2018 Fourth International Conference on Information Retrieval and Knowledge Management

175

TABLE III . MAP OF DIFFERENT EXPANSION MODELS BASED ON R OCCHIO PSEUDO-RELEVANCE FEEDBACK

Okapi BM25: A Probabilistic Retrieval model

Mean Average Precision (MAP) Without PRF 0.3162

No. of documents No. of terms BM25_Bo1 BM25_Bo2 BM25_KL

5 documents

5 terms

10 terms

15 terms

30 terms

50 terms

0.3867 (+22.30%)

0.3866 (+22.26%)

0.3899 (+23.31%

0.3905 (+23.50%)

0.3893 (+23.12%)

0.2103 (-33.49%)

0.2090 (-33.90%)

0.2180 (-31.06%)

0.2453 (-22.42%)

0.2694 (-14.80%)

0.3884 (+22.83%)

0.3899 (+23.31%)

0.3902 (+23.40%)

0.3936 (+24.48%)

0.3937 (+24.51%)

10 documents

5 terms

10 terms

15 terms

30 terms

50 terms

0.3935 (+24.45%)

0.3941 (+24.64%)

0.3970 (+25.55%)

0.3931 (+24.32%)

0.3927 (+24.19%)

0.1772 (-43.96%)

0.1913 (-39.50%)

0.1939 (-38.68%)

0.2339 (-26.03%)

0.2642 (-16.45%)

0.3974 (+25.68%)

0.3951 (+24.95%)

0.3965 (+25.40%)

0.3967 (+25.46%)

0.3965 (+25.40%)

15 documents

5 terms

10 terms

15 terms

30 terms

50 terms

0.3958 (+25.17%)

0.3924 (+24.10%)

0.3966 (+25.43%)

0.3914 (+23.78%)

0.1787 (-43.49%)

0.1826 (-42.25%)

0.1887 (-40.32%)

0.2263 (-28.43%)

0.2623 (-17.05%)

0.3915 (+23.81%)

0.3903 (+23.43%)

0.3965 (+25.40%)

0.3966 (+25.43%)

0.3923 (+24.07%)

2018 Fourth International Conference on Information Retrieval and Knowledge Management

176

Rancang Bangun Sistem Informasi Pengecekan Informasi Palsu (Hoax) Menggunakan Teknologi Single Page Application (SPA) dengan Metode Extreme Programming

Article

Full-text available

Jan 2022

Technology is often used by the public to share and find information on the internet through social media, websites, and others. The internet allows people to access information regardless of time and place, wherever and whenever, but one of the negative impacts that often occurs with the presence of the internet is the spread of false information (hoax). Sometimes it is very difficult to distinguish whether the information is true or false (hoax). The impact of the spread of false information is unrest and division in society. This study aims to design and create a website based system that can be used to check whether the information spread is true or false (hoax) by using Single Page Application (SPA) technology with extreme programming as its development method. This research produces a system that can make it easier for people to distinguish between false and true information.

Building a text collection for Urdu information retrieval

Article

Full-text available

Jul 2021

Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

Fuzzy Quadratic Programming Based Conflicting Strategy Management Technique for Company

Chapter

Mar 2021

In modern era, due to several variations of user requirements, number of company and start-up increases rapidly. Each company has its own strategy and rules for maintaining company profit and loss. Market condition is one parameter for this situation. Sometime, different crisis or pandemic situation are raised in the society which become crucial for handling and managing. So, company manage their productivity and sales in chronological order that maintain the equilibrium based on customer requirements and market conditions. This chapter is based on conflicting strategy management technique for company using quadratic programming. In this chapter, quadratic programming plays the role of mathematical optimization based on desire objective function along with constraints. In this model, fuzzy logic is used to makes the quadratic programming flexible which is used to maintain variations of the customer requirements and demands efficiently. The proposed method simulated and validated in LINGO optimization software in terms of conflicting strategies of the company.

Fuzzy-Based Optimal Solution for Minimization of Loss of Company Based on Uncertain Environment

Chapter

Mar 2021

In modern era, technology increases rapidly due to numerous requirements of the user or customer. There are various products and applications produced by the company with the context of requirement. One product is manufactured by several companies with some variants. So, several companies are competitor one to another. In this paper, an optimal solution is designed to minimize the losses of the company in uncertain environment. Here, uncertain environment indicates the environment that consists of several imprecise information. This information is created based on conflicting requirement of the users. So, in this paper, loss of company is minimized by reducing uncertainty. Quadratic programming is used to model the main objective and its related constraints in the form of nonlinear. In this model, decision variables are in the form of square. Fuzzy logic is used to reduce the imprecise information efficiently. The combination of both quadratic programming and fuzzy logic helps to model the main goal of the paper. Finally, the proposed method is formulated into LINGO optimization software to validate the main problem efficiently and effectively.

Pseudo-relevance feedback based query expansion using boosting algorithm

Article

Full-text available

Dec 2021
ARTIF INTELL REV

Retrieving relevant documents from a large set using the original query is a formidable challenge. A generic approach to improve the retrieval process is realized using pseudo-relevance feedback techniques. This technique allows the expansion of original queries with conducive keywords that returns the most relevant documents corresponding to the original query. In this paper, five different hybrid techniques were tested utilizing traditional query expansion methods. Later, the boosting query term method was proposed to reweigh and strengthen the original query. The query-wise analysis revealed that the proposed approach effectively identified the most relevant keywords, and that was true even for short queries. All the proposed methods’ potency was evaluated on three different datasets; Roshni, Hamshahri1, and FIRE2011. Compared to the traditional query expansion methods, the proposed methods improved the mean average precision values of Urdu, Persian, and English datasets by 14.02%, 9.93%, and 6.60%, respectively. The obtained results were also established using analysis of variance and post-hoc analysis.

A Hybrid Feature Selection Approach Based on LSI for Classification of Urdu Text

Chapter

Full-text available

Jan 2021

The feature selection method plays a crucial role in text classification to minimizing the dimensionality of the features and accelerating the learning process of the classifier. Text classification is the process of dividing a text into different categories based on their content and subject. Text classification techniques have been applied to various domains such as medical, political, news, and legal domains, which show that the adaptation of domain-relevant features could improve the classification performance. Despite the existence of plenty of research work in the area of classification in several languages across the world, there is a lack of such work in Urdu due to the shortage of existing resources. In this paper, First, we present a proposed hybrid feature selection approach (HFSA) for text classification of Urdu news articles. Second, we incorporate widely used filter selection approaches along with Latent Semantic Indexing (LSI) to extract essential features of Urdu documents. The hybrid approach tested on the Support Vector Machine (SVM) classifier on Urdu “ROSHNI” dataset. The evaluated results were used to compare with the results obtained by individual filter feature selection methods. Also, the approach is compared to the baseline feature selection method. The proposed approach results show a better classification with promising accuracy and better efficiency.

Named Entity Dataset for Urdu Named Entity Recognition Task

Article

Full-text available

Jan 2016

Named entity recognition(NER) and classification is a very crucial task in Urdu. One challenge among the others which makes Urdu NER task complex is the non-availability of enough linguistic resources. The NER research for English and other Western languages has a long tradition and significant amount of work has been done to solve NER problems in these languages. From resource availability aspect Western languages are counted resource plentiful languages. On the other hand, Urdu is far leg behind in terms of resources. In this paper we reported the development of NE tagged dataset for automated NER research in Urdu, especially with machine learning (ML) perspectives. The new developed Urdu NER dataset contains about 48000 words, comprising of 4621 named entities of seven named entity classes. The contents source of this new dataset is BBC Urdu and initially contains data from sport, national and international news domain. This new dataset can be used for training and testing purpose of various statistical and machine learning models such as e.g. hidden Markov model (HMM), maximum entropy (ME), Conditional random field(CRF), recurrent neural network (RNN) and so forth for conducting computational NER research in Urdu. Our goal is to bring in this new dataset freely widely acquirable, and to promote other researchers to exercise it as a criterial testbed for experimentations in Urdu NER research. In rest of the paper the new NER dataset will be referred as UNER dataset.

Further Investigations for Documents Information Retrieval Based on DWT

Chapter

Full-text available

Oct 2016

In most of the classical information retrieval models, documents are represented as bag-of words which takes into account the term frequencies (tf) and inverse document frequencies (idf) while they ignore the term proximity. Recently, term proximity among query terms has been observed to be beneficial for improving performance of document retrieval. Several applications of the retrieval have implemented tools to determine term proximity at the query formulation level. They rank documents based on the relative positions of the query terms within the documents. They must store all proximity data in the index, leading to a large index, which slows the search. Recently, many models use term signal representation to represent a query term, the query is transformed from the time domain to the frequency domain using transformation techniques such as wavelet. Discrete Wavelet Transform (DWT) uses multiple resolutions technique by which different frequencies are analyzed with different resolutions. The advantage of the DWT is to consider the spatial information of the query terms within the document rather than using only the count of terms. In this paper, in order to improve ranking score as well as improve the run-time efficiency to resolve the query, and maintain a reasonable space for the index, three different types of spectral analysis based on semantic segmentation are carried out namely: sentence-based segmentation, paragraph-based segmenta-tion and fixed length segmentation; and also different term weighting is performed according to term position.

A Novel Information Retrieval Approach using Query Expansion and Spectral-based

Article

Full-text available

Oct 2016

Abstract—Most of the information retrieval (IR) models rank the documents by computing a score using only the lexicographical query terms or frequency information of the query terms in the document. These models have a limitation as they does not consider the terms proximity in the document or the term-mismatch or both of the two. The terms proximity information is an important factor that determines the relatedness of the document to the query. The ranking functions of the Spectral-Based Information Retrieval Model (SBIRM) consider the query terms frequency and proximity in the document by comparing the signals of the query terms in the spectral domain instead of the spatial domain using Discrete Wavelet Transform (DWT). The query expansion (QE) approaches are used to overcome the word-mismatch problem by adding terms to query, which have related meaning with the query. The QE approaches are divided to statistical approach Kullback-Leibler divergence (KLD) and semantic approach PWNET that uses WordNet. These approaches enhance the performance. Based on the foregoing considerations, the objective of this research is to build an efficient QESBIRM that combines QE and proximity SBIRM by implementing the SBIRM using the DWT and KLD or P-WNET. The experiments conducted to test and evaluate the QESBIRM using Text Retrieval Conference (TREC) dataset. The result shows that the SBIRM with the KLD or P-WNET model outperform the SBIRM model in precision (P@), R-precision, Geometric Mean Average Precision (GMAP) and Mean Average Precision (MAP).

Urdu language processing: a survey

Article

Full-text available

Mar 2017
ARTIF INTELL REV

Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.

Quary Expansion Using Local and Global Document Analysis

Article

Aug 2017

Automatic query expansion has long been suggested as a technique for dealing with the fundamental issue of word mismatch in information retrieval. A number of approaches to expansion have been studied and, more recently, attention has focused on techniques that analyze the corpus to discover word relationship (global techniques) and those that analyze documents retrieved by the initial query ( local feedback). In this paper, we compare the effectiveness of these approaches and show that, although global analysis haa some advantages, local analysia is generally more effective. We also show that using global analysis techniques.

Pseudo-Query Reformulation

Conference Paper

Mar 2016

Fernando Diaz

Automatic query reformulation refers to rewriting a user’s original query in order to improve the ranking of retrieval results compared to the original query. We present a general framework for automatic query reformulation based on discrete optimization. Our approach, referred to as pseudo-query reformulation, treats automatic query reformulation as a search problem over the graph of unweighted queries linked by minimal transformations (e.g. term additions, deletions). This framework allows us to test existing performance prediction methods as heuristics for the graph search process. We demonstrate the effectiveness of the approach on several publicly available datasets.

Modern Information Retrieval: A Brief Overview

Article

Jan 2001

Comparison of Hindi and Urdu in computational context

Article

Jan 2012

Kiran KIRAN Riaz

Pseudo-Query Reformulation

Article

Jul 2015

Fernando Diaz

Automatic query reformulation refers to rewriting a user's original query in order to improve the ranking of retrieval results compared to the original query. We present a general framework for automatic query reformulation based on discrete optimization. Our approach, referred to as pseudo-query reformulation, treats automatic query reformulation as a search problem over the graph of unweighted queries linked by minimal transformations (e.g. term additions, deletions). This framework allows us to test existing performance prediction methods as heuristics for the graph search process. We demonstrate the effectiveness of the approach on several publicly available datasets.

Elements of Information Theory. Wiley, New-York

Book

Jan 2006

Query Expansion in Information Retrieval for Urdu Language

Recommended publications

Kali Chirya

Innovations in language - an experiment in comprehensibility with reference to Urdu in mass media an...

(Urdu) Molana Abdul Majeed Salik Hayat Wa Kaarnamey.

"A comparison of ontology based and keyword Based Query processing System for Urdu language"