Conference PaperPDF Available

Query Expansion in Information Retrieval for Urdu Language

Authors:
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Query Expansion in Information Retrieval for Urdu
Language
Imran Rasheed
Department of CSE
Indian Institute of Technology (ISM)
Dhanbad, India
imranrasheed@cse.ism.ac.in
Haider Banka
Department of CSE
Indian Institute of Technology (ISM)
Dhanbad, India
haider.banka@gmail.com
Abstract—The information retrieval system need to be
upgraded constantly to meet the challenges posed by the
advanced user queries as the search system becoming more
sophisticated with time. These problems have been addressed
extensively in recent times in several research communities to
achieve quick and relevant outcome. One such approach is to
augment the query where the automatic query expansion
increases the precision in information retrieval even if it can cut
down the results for some queries. Here, the above approach was
tested with the present Urdu data collection obtained via
different expansion models such as KL, Bo1 and Bo2. The
current collection is quite large in size compared to other existing
Urdu datasets. It comprises of 85,304 documents in a TREC
schemes and 52 topics with their relevance assessment. In this
paper we emphasize to enhance the retrieval model using the
query expansion which is never done before on Urdu text.
However, we show that a deep analysis of initial and expanded
queries brings fascinating insights that could avail future
research in the domain.
Keywords—Urdu Information Retrieval, Query Expansion;
Relevance Feedback, Local Analysis, Urdu Corpus, Relevance
Judgment.
I. INTRODUCTION
In the last fifty years, a massive progress is being reported in
Information Retrieval (IR) area but most of it was carried out in
English [1]. Beside this, a number of IR communities (like
TREC, CLEF, NTCIR and FIRE) have taken noticeable
initiatives in several East-Asian, European and South-Asian
languages. However, there are still number of languages where
nothing progressive has been reported when compared with
aforementioned languages and Urdu is being one among them
and its poor response is mainly attributed to the lack of
available linguistic resources. It is remarkable to note that in
the last decade, an ample work is reported in different South
Asian languages mostly comprising the area of machine
learning and natural language processing but Urdu is still quite
distant away from all such initiatives [2]. Additionally, for any
linguistic study, the benchmark dataset of that given language
is a basic necessity for performing the advance research
experiments. Few large datasets of news genre is for general
interest have been constructed for English and other European
and Asian languages are publicly available [3, 4]. On the other
hand, the available standard data in case of Urdu is scarce and
limited to only domain specific. For example, Becker-Riaz
corpus contains about 7000 short news articles gathered from
BBC news [5], EMILLE corpus includes 1,640,000 words of
Urdu text and 512,000 words of spoken Urdu [6], IJCNLP-
2008 NE tagged corpus has 40,000 words for named entity
recognition [7] and CLE datasets have 100,000 words used for
POS tagging and named entity recognition [8] are the only
Urdu corpora freely available in public. All these datasets are
supposedly smaller in size and are not suitable enough to
accomplish various Urdu information retrieval tasks such as
text summarization, text clustering, Ad-hoc information
retrieval, question-answering, query expansion, categorization
etc. However, the authors of [9], recently developed a NE
tagged dataset used only for NER analysis. So, the present
work is undertaken to have a constructive and a prominent
Urdu database which can readily be used later for different
linguistic study. In [10], the authors build an efficient
QESBIRM platform that combines QE and proximity SBIRM
approaches to significantly enhance information retrieval
efficiency. Here SBIRM method will be implemented using
either DWT, KLD or P-WNET methods. In [11] the authors
gives an overview of the information retrieval models based on
query expansion. The description also includes some
explanation of the practical undertaken and their methods of
implementation. According to reference [12] investigated the
combined approach of the association and distribution based
term selection methods to advance the overall retrieval
effectiveness. In another place the authors [13] reported
positive results in text retrieval using WordNet-based query
expansion. More recently, the authors [11] used three different
types of spectral analysis based on semantic segmentation are
carried out namely: sentence-based segmentation, paragraph-
based segmentation and fixed length segmentation to improve
ranking score as well as improve the run-time efficiency to
resolve the query, and maintain a reasonable space for the
index.
A Urdu text collection consist of 85,304 newswire documents
for general interest based on Text Retrieval Conference
(TREC) standard, also includes 52 topics from different
2018 Fourth International Conference on Information Retrieval and Knowledge Management
171
978-1-5386-3812-5/18/$31.00 ©2018 IEEE
categories and their relevance judgment. Here, we examine the
efficiency of different query expansion (QE) models like KL,
Bo1 and Bo2 on our Urdu text collection. The paper is
organized as follows: Section II describes some challenges in
Urdu language. Section III outlines some related work and the
state-of-the-art approaches to query expansion whereas Section
IV, describes the brief summary of the collection statistics and
Section V reviews some query expansion model. Section VI
and Section VII evaluate some experiments to investigate the
retrieval effectiveness using query expansion. The retrieval
results are described in Section VIII. Section IX concludes this
paper and suggests some avenues for future work.
II. THE URDU LANGUAGE
Urdu being a National Language of Pakistan is spoken well
over by 300 million people worldwide [7, 14] including a large
chunk of speakers from Indian sub-continent. It is well adopted
by Bollywood, the famous Indian cinema which introduce it to
large non Urdu speaking populations including Middle East,
Africa and several European and Latin American regions. Urdu
is very much alike with Hindi language but distinct in script
writing. Both languages share common vocabulary and
grammar of daily speech [15, 16]. Urdu morphology,
orthography, word tokenization, word space and poor-
resources are number of challenges that makes it more
challenging among the languages that follows Arabic script.
Therefore, it offers a unique yet significant attraction to the
linguistic research world.
A. Orthography
The Urdu script is typically written from right to left like
Arabic script in Nastaliq style [4]. Most of the characters
acquires a different shapes depending on their position in the
ligature e.g. a letter may appear differently depending on its
position in a word such as occurring in the beginning, end,
center or appearing totally isolated [17].
B. Morphology
Morphology is the study of word organization [18].
Urdu
is morphologically rich in nature that means multiple words
can be extracted from a single source [19]. The following
Table I shows the few variants that exists from the single word
in Urdu.
TABLE I. D
IFFEREN T
V
ARIANTS OF
W
ORD
PLAY
Play ϞϴϬ̯
You Play
ϞϴϬ̯ ϮΗ
؏ϮϠϴϬ̯ ϢΗ
ΎϨϠϴϬ̯ ϢΗ
ؐϴϠϴϬ̯ ̟΁
كϴϠϴϬ̯ ̟΁
؏ϮϠϴϬ̯ ̟΁
فΘϠϴϬ̯ ̟΁
C. Tokenization and Word Segmentation
Tokenization is a process of breaking a text at word (or
root) level. Unlike in English text where two distinct words
are generally found separated by a space (tokens), Urdu on the
other hand, faces tokenization and word space problems [20].
Moreover, the use of Space in Urdu language is quite bugging
when viewed from information retrieval perspective. Here, the
space is not used after each single word, instead, it is only
used after the word is joined with an appropriate suffixes.
Therefore, enquiries made can often result in unwanted
outcome which left a huge space for researchers to overhaul
the present tokenization issues in Urdu language.
D. Diction Problem
Diction related issues correspond to using word from a set
of words having same meanings. The Urdu language is full of
such diction problems due to its computationally complex
nature. For example, words in Urdu that are different in
spelling but having same meanings are: Ground (տϧ΅΍ή̳ ˬտϧϭ΍ή̳).
E. Loan Words
The Urdu language comprises many loan words from
several foreign languages including English, Arabic, Persian
and Turkish. Urdu is relatively a new language in contrast to a
historically rich languages such as Arabic, Persian and Turkish
and it is coined mainly to bridge a gap between speakers of
afore-mentioned languages and native North Indian languages
mainly Hindi. Over a course of time, Urdu saw an
embracement of several words from other languages such as
English which is routinely used not only in spoken Urdu but
also in written Urdu. Some examples in Urdu language
borrowed from English are given as follows: Promotion
(ϦηϮϣϭή̡), Police (βϴϟϮ̡), and Company (̶Ϩ̢Ϥ̯).
III. RELATED WORK
Queries given by users to the information retrieval system
are usually short, ambiguous and imperfect [21] that mostly
returns an inaccurate response to the queries addressed. Many
essential terms can be absent from the given query that might
lead search engine or information retrieval system to respond
poorly or ineffectively resulting in documents having less
relevance to the queries made. This problem is first addressed
by Rocchio [22], who later proposed a Relevance Feedback
system that automatically expand the original queries by adding
some extra terms such as synonyms, plurals, modifiers etc. as
per user feedback or query reformulation [23, 24].
Query expansion is widely practiced methodology to
significantly enhance the user’s experience in preferred
information response [25-28]. Query Expansion is basically a
process of adding significant and contextually related words to
reformulate the seed query to enhance retrieval performance in
information retrieval operations. Several approaches have been
proposed by researchers for query expansion operations but
among all the approaches addressed, Pseudo-Relevance
Feedback (PRF), Automatic Query Expansion and Blind
Relevance Feedback are observed to be the most effective and
useful in data retrieval [27]. In this approach, an original query
is initially fired using typical information retrieval system [29-
31]. Then, the related terms are extracted from the seed query.
For example, the top K ranked documents that are returned in
the first attempt of data retrieval may contain some important
2018 Fourth International Conference on Information Retrieval and Knowledge Management
172
terms that can ease to separate relevant documents from the
irrelevant ones. In general, the obtained expansion terms are
either according to the most frequent terms in the feedback
documents or according to the most specific terms found in the
feedback documents or within the entire document collection.
Fig. 1 describes a typical work flow of how PRF based
information retrieval systems work. Several methods have been
discovered by researchers for QE. In this work, three QE
models namely Bo1, Bo2 and KL are adopted for the analysis.
Furthermore, a comparative study on the retrieval effectiveness
of state-of-the-art retrieval models on Urdu languages is
discussed in this paper and the effectiveness of applying QE to
improve the retrieval accuracy is explored. Terrier is used as
information retrieval framework for all the experiments
undertaken in this work [32].
Fig. 1. Architecture of Pseudo Relevance Feedback based System.
IV. DOCUMENT COLLECTION
A. Documents, Topics, and Relevance Assessments
To measure the efficiency of different query expansion
models described in this paper are based on data from our own
developed Urdu dataset. The collection consists of 85,304
documents based on TREC specifications. All documents in the
collection are in UTF-8 encoding scheme. A sample document
in standard TREC format is shown in Fig. 2.
<DOC>
<DOCNO>26_July_2012_Sportz4</DOCNO>
<TITLE> ՋδϴՌ ̲Ϩ̰Ϩϳέ :ϢηΎ٫ ؟ϠϣԸ΍ ـήδϴΗ ήΒϤϧ ή̡ فΌ̳Ը΍</TITLE>
<TEXT>
̶ΑϮϨΟ ΎϘϳήϓ΍ ف̯ ̮ϴΟ βϠϴ̯ έϭ΍ ϢηΎ٫ ؟ϠϣԸ΍ ̶΋Ը΍ ̶γ ̶γ ՋδϴՌ ̰Ϩϳέ̲Ϩ ؐϴϣ
ـήγϭΩ έϭ΍ ـήδϴΗ ήΒϤϧ ή̡ فΌ̳Ը΍ لؐϴ٫ ̶ϤϟΎϋ ̲Ϩ̰Ϩϳέ ؐϴϣ Ϭ̩ ؐϴϣ γف έΎ̩
̟ΎՌ ϦθϳίϮ̡ ̶ΑϮϨΟ
ΎϘϳήϓ΍ ف̯ ؏ϮϨϴϤδՍϴΑ ف̯ αΎ̡ لف٫ ̵ήγ ̰ϨϟΎ ف̯ έΎϤ̯
΍έΎ̯Ύ̴Ϩγ έϮΘγΪΑ ̟ΎՌ ή̡ لؐϴ٫ ̮ϴΟ βϠϴ̯ ˬـήγϭΩ ϢηΎ٫ ؟ϠϣԸ΍ Ϟ̡ήՌ γ̵ή̪Ϩ
έϮ̰γ΍ فϧή̯ ف̯ ΪόΑ ˬـήδϴΗ έΪϨ̩ ϝΎ̡ ˬفϬΗϮ̩ ̵΍ ̶Α ̵վ ίήΌϴϠϳϭ ؐϳϮ̪ϧΎ̡
έϭ΍ Ϣϳή̳ ϬΘϤγ΍ فՍϬ̩ ήΒϤϧ ή̡ لؐϴ٫ ̵ϮϠϳήՍγԸ΍ Ϟ̰ϴ΋Ύϣ ̭έϼ̯ Ύ̯ Ϥϧ
ήΒ ؏΍ϮΗΎγ
لف٫ ϥΎΘδ̯Ύ̡ ف̯ βϧϮϳ ϥΎΧ ˬؐϳϮϧ ή٬υ΍ ̶Ϡϋ ؐϳϮ٫έΎϴ̳ έϭ΍ ΡΎΒμϣ Τϟ΍ϖ
ؐϳϮ٫ήϴΗ ήΒϤϧ ή̡ لؐϴ٫ ϩϭ ϝԸ΍ ίέտϧԹ
ϭ΍έ ؐϴϣ ؟Ϡ̴ϨΑ ̶θϳΩ ΐϴ̰η ϦδΤϟ΍ ̶̯ ؟̴Ο
̟ΎՌ ή̡ فΌ̳Ը΍ لؐϴ٫ ̲ϨϟϮΑ ؐϴϣ Ϟϳվ ϦϴՍγ΍ 896 βՍϨ΋΍Ϯ̡ ف̯ ϬΗΎγ ̟ΎՌ ή̡ ϴ٫لؐ
̶ϧΎΘδ̯Ύ̡ ϑԸ΍ ήϨ̢γ΍
Ϊϴόγ ϞϤΟ΍ Ύ̯ ήΒϤϧ ΍ήγϭΩ لف٫ Ջϔϴϟ ϡέԸ΍ ήϨ̢γ΍ ϋϦϤΣήϟ΍ΪΒ
ؐϳϮγΩ ήΒϤϧ ή̡ لؐϴ٫
</TEXT>
</DOC>
Fig. 2. A Sample file in TREC format.
A topic resembles a standards common with other retrieval
initiatives such as TREC. Any ordinary topic has three
sections, i.e., title, description, and the narrative section. In the
given process, a unique identification number is assigned to the
topic which separates it out from the other similar topics. One
such example is given in Fig. 3. In addition, a set of 52 topics
with their relevance assessment have been used for the analysis
of query expansion.
<topic>
<topid>10</topid>
<title>
εέΩ΁ ̲Ϩγ΅Ύ٫ ̶Ս΋ΎγϮγ فϟΎՌϮϬ̳ ̶ϔόΘγ΍
</title>
<desc>
εέΩ΁ ̲Ϩγ΅Ύ٫ ̶Ս΋ΎγϮγ فϟΎՌϮϬ̳ ف̯ ΪόΑ ήϳίϭ ̶Ϡϋ΍ ̶ϔόΘγ΍
</desc>
<narr>
ϳϭΎΘγΩ ؟ϘϠόΘϣ فγ ̶ϔόΘγ΍ Ύ̯ ϥ΍ϭΎ̩ ̭Ϯη΍ ̶Ϡϋ΍ ήϳίϭ ف̯ ήՍη΍έΎ٬ϣϧϮ٫ Ε΍ΰف
فΌ٫Ύ̩ . Ϯ̯ Ε΍ΰϳϭΎΘγΩεέΩ΁ ̲Ϩγ΅Ύ٫ ̶Ս΋ΎγϮγ فϟΎՌϮϬ̳ ف̯ ΪόΑ ϳίϭή ̶Ϡϋ΍
̶ϔόΘγ΍ ΙϮϠϣ ؐϴϣفΌ٫Ύ̩ ΎϧϮ٫ ϞϤΘθϣ ή̡ ΕΎϣϮϠόϣ ϖϠόΘϣ فγ فϧϮ٫. ̳فϟΎՌϮϬ
ϦϴϣΎπϣ ف̯ ؏ϭήΒΧ ؐϴϣ ـέΎΑ ف̯ ؏΅ϮϠ٬̡ ή̴ϳΩ ف̯)ϳέΎΘϓή̳ ή̴ϳΩ ؏Ϯ /
̶ϔόΘγ΍ ϩήϴϏϭ ϩήϴϏϭ (ؐϴ٫ ؟ϘϠόΘϣ ήϴϏ
</narr>
</topic>
Fig. 3. A Sample topic (Topic 2, “Aadarsh Housing Scams”).
V. QUERY EXPANSION MODELS
In the given analysis, DFR-based term weighing models
namely, Bo1 [33], Bo2 [33] and KL are employed by using
Terrier’s search engine. Terrier employs a Divergence from
Randomness based QE mechanism which is a generalization of
Rocchio’s method [26]. Initially, the DFR model measure the
weightage of the terms from the top ranked documents. The
most essential terms are then collected from returned results
and added with the original query so as to generate an
expanded query. The above mentioned weighting schemes
models are shown in the sections V-A, V-B, V-C [34].
A. Kullback-Leibler (KL) Model
Kullback-Liebler divergence computes the divergence
between the probability distributions of terms in the whole
collection and in the top ranked documents to obtain the first
pass retrieval using the original user query [35]. For the term t,
this divergence is given by
ݓݐൌܲ
ݐכ݈݋݃
(1)
ܲ
ݐ
σ௧௙௧ǡௗ
೏א೙
σσ ௧௙
ǡௗ
א೏
೏א೙
(2)
ܲ
ݐ
σ௧௙௧ǡௗ
೏א೘
σσ௧௙
ǡௗ
א೏
೏א೘
(3)
Where
ܲ
(t) is the probability of the term t in the top ranked
documents n.
ܲ
(t) is the probability of the term t in the whole
collection.
B. Bose-Einstein1 (Bo1) Model
This model is based on the Bose Einstein Statistic and the
weight of the term t in the top ranked documents (rank ranging
from 3 to 10) is given by [36].
2018 Fourth International Conference on Information Retrieval and Knowledge Management
173
ݓݐσݐ݂ݐǡ݀כ݈݋݃
ଵା௉
൅݈݋݃
ͳ൅ܲ
ௗא௡
(4)
ܲ
σ௧௙௧ǡௗ
೏א೘
(5)
where equation (5) denotes the average term frequency of t in
the collection (N is the number of documents in the collection).
C. Bose-Einstein2 (Bo2) Model
The scoring formula of Bo2 is given by
ݓݐൌݐ݂
ή݈݋݃
ଵା௉
൅݈݋݃
൫ͳ ൅ ܲ
(6)
ݐ݂
is the frequency of the term in the top-returned
documents.
ܲ
is given by
ி
, where F is the term frequency of the
query term in the collection and N is the number
of
documents in the collection.
ܲ
ݔ
௧௙
ή௟
௧௢௞௘௡
is the probability of the term t in the
whole collection. where lx is the sum of the length of
the exp_doc top ranked documents where exp_doc is a
parameter of the query expansion methodology. F, is
the term frequency of the query term in the whole
collection.
݈
is the size in tokens.
ݐ݋݇݁݊
is the total number of tokens in the collection.
VI. EXPERIMENT ARCHITECTURE
In this paper, the performance of Urdu collection based on
different query expansion methods is measured. At first, the
search with Okapi (BM25) model [22] was carried out as a
weighting scheme using only the title field of the original
queries. This method is chosen as benchmark to enhance the
search results by expanding each topic with similar concepts
as shown in Table II. The expanded terms assist the process of
matching the relevant documents to the associated query.
Moreover, it ease to reduce the discrepancy between the
documents and the queries [37]. Several experiments were
performed, using a set of 52 queries with Okapi (BM25)
model based on weighting scheme. Here, Rocchio beta = 0.4,
and retrieval parameter b = 0.4 gives the best results for query
expansion. The evaluation is performed with Terrier
information retrieval framework that was found to be quite
effective in indexing, retrieval and evaluation of English and
other Non-English documents. In order to evaluate the results
of the retrieval process, a program inside the TREC
conference, trec_eval, is used. Trec_Eval is quite significant in
evaluation of different measures such as the total number of
documents (Retrieved, Relevant and Rel_ret (relevant, and
retrieved)) over all queries or MAP, R-prec, and Interpolated
Recall-Precision Averages. Thus in evaluation of the retrieval
performance, the MAP (mean average precision) measure is
chosen in this experiment, where its value is computed based
on (maximum) 100 retrieved documents per query.
TABLE II. MAP,
R-
PREC
,
P@
K OF
BM25
Okapi BM25
Mean Average Precision
a
0.3162
R-precision 0.2990
P@10 0.3308
P@20 0.2702
P@100 0.1360
VII. EXPERIMENTS FOR QUERY EXPANSION WITH
BO1, BO2 & KL MODELS
In these experiments, Rocchio's approach is adopted to
perform the enhancement of the Okapi (BM25) retrieval
method. Initially, 5, 10 or 15 top retrieved documents were
chosen, from each set of retrieved documents 5, 10, 15, 30 or
50 terms were extracted. These terms were then added to
original queries and examine whether the results are
significantly different. Top 100 documents are retrieved
initially using the retrieval model. Thereafter, the model is
modified with Bo1, Bo2 and KL expansion models and is also
analyzed with MAP methodology for further improvement in
the result. Table III shows the results obtained after query
expansion.
VIII. RESULTS AND DISCUSSIONS
In this work, the baseline text retrieval is compared with the
expansion variants as discussed in section V. Table III
presented the results obtained on Urdu dataset which shows
that the highest MAP is found for BM25 (b=1.0) enhanced by
Bo1 model having an overall (+25.55%) improvement
compared to baseline text retrieval method when the number of
documents and selected terms were 10 and 15 respectively.
Likewise, the highest MAP for BM25 (b=1.0) enhanced with
KL expansion model shows a (+25.68%) growth over baseline
when the number of documents is 10 and the number of
selected terms are 5. Moreover, in all the cases addressed, the
Bo1 and KL models performed almost similarly whereas the
results obtained with Bo2 is not appreciable because lesser
number of relevant documents are retrieved due to the term
mismatch issues between the original query terms and the
candidate expansion terms.
IX. CONCLUSION AND FUTURE WORK
The focus of the present work is to observe the effect of
query expansion on the given Urdu dataset which has not been
addressed before on such a scale. The results show that KL
model performed extremely well in comparison to other
expansion models such as Bo1 and Bo2 on the present Urdu
data collection. In addition to this, the Kullback-Leibler model
was found to significantly enhance the MAP of the retrieved
data by almost 22-24% in the above study. In future, external
resource like WordNet approach to Bo2 model will be
undertaken to enhance the efficiency in Urdu information
retrieval.
2018 Fourth International Conference on Information Retrieval and Knowledge Management
174
ACKNOWLEDGMENT
We sincerely thank to Mr. Zahoor Ahmad Shora chief editor
`Daily Roshni' for his generous contribution in freely sharing
raw data for the collection and also I give thanks to Mr.
Hamaid Mehmood for his guidance and kind support.
REFERENCES
[1] A. Singhal, “Modern information retrieval: A brief overview,” IEEE
Data Eng. Bull., vol. 24, no. 4, pp. 35–43, 2001.
[2] P. Majumdar, M. Mitra, S. K. Parui, and P. Bhattacharya, “Initiative for
indian language ir evaluation,” 2007.
[3] E. Darrudi, M. R. Hejazi, and F. Oroumchian, “Assessment of a modern
farsi corpus,” in Proceedings of the 2nd Workshop on Information
Technology & its Disciplines (WITID), 2004.
[4] A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,”
Artificial Intelligence Review, pp. 1–33, 2016.
[5] D. Becker and K. Riaz, “A study in urdu corpus construction,” in
Proceedings of the 3rd workshop on Asian language resources and
international standardization-Volume 12. Association for Computational
Linguistics, 2002, pp. 1–5.
[6] A. Hardie, “Developing a tagset for automated part-of-speech tagging in
urdu.” in Corpus Linguistics 2003, 2003.
[7] S. Hussain, “Resources for urdu language processing.” in IJCNLP, 2008,
pp. 99–100.
[8] S. Urooj, S. Hussain, F. Adeeba, F. Jabeen, and R. Parveen, “Cle urdu
digest corpus,” LANGUAGE & TECHNOLOGY, vol. 47, 2012.
[9] W. Khana, A. Daudb, J. A. Nasira, and T. Amjada, “Named entity
dataset for urdu named entity recognition task,” Organization, vol. 48, p.
282.
[10] S. Alnofaie, M. Dahab, and M. Kamal, “A novel information retrieval
approach using query expansion and spectral-based,” information
retrieval, vol. 7, no. 9, 2016.
[11] M. Y. Dahab, M. Kamel, and S. Alnofaie, “Further investigations for
documents information retrieval based on dwt,” in International
Conference on Advanced Intelligent Systems and Informatics. Springer,
2016, pp. 3–11.
[12] D Pal, M Mitra, and K Datta, “Query expansion using term distribution
and term association,” arXiv preprint arXiv:1303.0667, 2013.
[13] D Pal, M Mitra, K Datta, “Improving query expansion using wordnet,”
Journal of the Association for Information Science and Technology, vol.
65, no. 12, pp. 2469–2478, 2014.
[14] K Riaz, “Baseline for urdu ir evaluation,” in Proceedings of the 2
nd
ACM
workshop on Improving non english web searching. ACM, 2008, pp. 97–
100.
[15] K Riaz, “Urdu is not hindi for information access,” in Workshop on
Multilingual Information Access, SIGIR, 2009.
[16] K Riaz, “Comparison of hindi and urdu in computational context,” Int J
Comput Linguist Nat Lang Process, vol. 1, no. 3, pp. 92–97, 2012.
[17] M. I. Razzak, “Online urdu character recognition in unconstrained
environment,” Ph.D. dissertation, PhD thesis, International Islamic Uni-
versity, Islamabad, 2011.
[18] V. Gupta, N. Joshi, and I. Mathur, “Design & development of rule based
inflectional and derivational urdu stemmer usal,” in Futuristic Trends on
Computational Analysis and Knowledge Management (ABLAZE), 2015
International Conference on. IEEE, 2015, pp. 7–12.
[19] S. Iqbal, M. W. Anwar, U. I. Bajwa, and Z. Rehman, “Urdu spell
checking: Reverse edit distance approach,” in Proceedings of the 4
th
Workshop on South and Southeast Asian Natural Language Processing,
2013, pp. 58–65.
[20] S. Stymne, “Spell checking techniques for replacement of unknown
words and data cleaning for haitian creole sms translation,” in
Proceedings of the Sixth Workshop on Statistical Machine Translation.
Association for Computational Linguistics, 2011, pp. 470–477.
[21] A. Spink, D. Wolfram, M. B. Jansen, and T. Saracevic, “Searching the
web: The public and their queries,” Journal of the Association for
Information Science and Technology, vol. 52, no. 3, pp. 226–234, 2001.
[22] S. E. Robertson, “The probability ranking principle in ir,” Journal of
documentation, vol. 33, no. 4, pp. 294–304, 1977.
[23] F. Diaz, “Pseudo-query reformulation,” in European Conference on
Information Retrieval. Springer, 2016, pp. 521–532.
[24] G. Salton and C. Buckley, “Improving retrieval performance by
relevance feedback,” Readings in information retrieval, vol. 24, no. 5,
pp. 355–363, 1997.
[25] D. Metzler and W. B. Croft, “Latent concept expansion using markov
random fields,” in Proceedings of the 30th annual international ACM
SIGIR conference on Research and development in information
retrieval. ACM, 2007, pp. 311–318.
[26] J. J. Rocchio, “Relevance feedback in information retrieval,” The Smart
retrieval system-experiments in automatic document processing, 1971.
[27] J. Xu and W. B. Croft, “Quary expansion using local and global
document analysis,” in ACM SIGIR Forum, vol. 51, no. 2. ACM, 2017,
pp. 168–175.
[28] C. Zhai and J. Lafferty, “Model-based feedback in the language
modeling approach to information retrieval,” in Proceedings of the tenth
international conference on Information and knowledge management.
ACM, 2001, pp. 403–410.
[29] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan,
“Searching the web,” ACM Transactions on Internet Technology
(TOIT), vol. 1, no. 1, pp. 2–43, 2001.
[30] R. Baeza-Yates and B. Ribeiro-Neto, “Modern information retrieval
addison-wesley longman,” Reading MA, 1999.
[31] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes:
compressing and indexing documents and images. Morgan Kaufmann,
1999.
[32] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma,
“Terrier: A high performance and scalable information retrieval
platform,” in Proceedings of the OSIR Workshop, 2006, pp. 18–25.
[33] G. Amati, “Probability models for information retrieval based on
divergence from randomness,” Ph.D. dissertation, University of
Glasgow, 2003.
[34] V. Plachouras, B. He, and I. Ounis, “University of glasgow at trec 2004:
Experiments in web, robust, and terabyte tracks with terrier.” in TREC,
2004.
[35] T. Cover and J. Thomas, “Elements of information theory wiley new
york,” NY Google Scholar, 1991.
[36] C. Macdonald, B. He, V. Plachouras, and I. Ounis, “University of
glasgow at trec 2005: Experiments in terabyte and enterprise tracks with
terrier.” in TREC, 2005.
[37] M. Shokouhi and J. Zobel, “Robust result merging using sample-based
score estimates,” ACM Transactions on Information Systems (TOIS),
vol. 27, no. 3, p. 14, 2009.
2018 Fourth International Conference on Information Retrieval and Knowledge Management
175
TABLE III . MAP OF DIFFERENT EXPANSION MODELS BASED ON R OCCHIO PSEUDO-RELEVANCE FEEDBACK
Okapi BM25: A Probabilistic Retrieval model
Mean Average Precision (MAP) Without PRF 0.3162
No. of documents No. of terms BM25_Bo1 BM25_Bo2 BM25_KL
5 documents
5 terms
10 terms
15 terms
30 terms
50 terms
0.3867 (+22.30%)
0.3866 (+22.26%)
0.3899 (+23.31%
0.3905 (+23.50%)
0.3893 (+23.12%)
0.2103 (-33.49%)
0.2090 (-33.90%)
0.2180 (-31.06%)
0.2453 (-22.42%)
0.2694 (-14.80%)
0.3884 (+22.83%)
0.3899 (+23.31%)
0.3902 (+23.40%)
0.3936 (+24.48%)
0.3937 (+24.51%)
10 documents
5 terms
10 terms
15 terms
30 terms
50 terms
0.3935 (+24.45%)
0.3941 (+24.64%)
0.3970 (+25.55%)
0.3931 (+24.32%)
0.3927 (+24.19%)
0.1772 (-43.96%)
0.1913 (-39.50%)
0.1939 (-38.68%)
0.2339 (-26.03%)
0.2642 (-16.45%)
0.3974 (+25.68%)
0.3951 (+24.95%)
0.3965 (+25.40%)
0.3967 (+25.46%)
0.3965 (+25.40%)
15 documents
5 terms
10 terms
15 terms
30 terms
50 terms
0.3958 (+25.17%)
0.3924 (+24.10%)
0.3966 (+25.43%)
0.3966 (+25.43%)
0.3914 (+23.78%)
0.1787 (-43.49%)
0.1826 (-42.25%)
0.1887 (-40.32%)
0.2263 (-28.43%)
0.2623 (-17.05%)
0.3915 (+23.81%)
0.3903 (+23.43%)
0.3965 (+25.40%)
0.3966 (+25.43%)
0.3923 (+24.07%)
2018 Fourth International Conference on Information Retrieval and Knowledge Management
176
... Extreme Programming (XP) adalah suatu proses pengembangan perangkat lunak yang menggunakan konsep berorientasi objek dan target dari metode ini ialah tim dengan skala kecil sampai dengan menengah (Supriyatna & Informatika, 2018). Query Expansion merupakan suatu metode yang digunakan untuk mengimprovisasi hasil dari pencarian dengan memperluas query dari pengguna untuk mendapatkan hasil penelusuran yang lebih baik (Rasheed & Banka, 2018). ...
Article
Full-text available
Technology is often used by the public to share and find information on the internet through social media, websites, and others. The internet allows people to access information regardless of time and place, wherever and whenever, but one of the negative impacts that often occurs with the presence of the internet is the spread of false information (hoax). Sometimes it is very difficult to distinguish whether the information is true or false (hoax). The impact of the spread of false information is unrest and division in society. This study aims to design and create a website based system that can be used to check whether the information spread is true or false (hoax) by using Single Page Application (SPA) technology with extreme programming as its development method. This research produces a system that can make it easier for people to distinguish between false and true information.
... Tokenization is the process of splitting strings in a given document into words known as tokens by using a tokenizer to read delimiters such as /-"[]():?<>! Characters [22]. ...
Article
Full-text available
Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.
Chapter
In modern era, due to several variations of user requirements, number of company and start-up increases rapidly. Each company has its own strategy and rules for maintaining company profit and loss. Market condition is one parameter for this situation. Sometime, different crisis or pandemic situation are raised in the society which become crucial for handling and managing. So, company manage their productivity and sales in chronological order that maintain the equilibrium based on customer requirements and market conditions. This chapter is based on conflicting strategy management technique for company using quadratic programming. In this chapter, quadratic programming plays the role of mathematical optimization based on desire objective function along with constraints. In this model, fuzzy logic is used to makes the quadratic programming flexible which is used to maintain variations of the customer requirements and demands efficiently. The proposed method simulated and validated in LINGO optimization software in terms of conflicting strategies of the company.
Chapter
In modern era, technology increases rapidly due to numerous requirements of the user or customer. There are various products and applications produced by the company with the context of requirement. One product is manufactured by several companies with some variants. So, several companies are competitor one to another. In this paper, an optimal solution is designed to minimize the losses of the company in uncertain environment. Here, uncertain environment indicates the environment that consists of several imprecise information. This information is created based on conflicting requirement of the users. So, in this paper, loss of company is minimized by reducing uncertainty. Quadratic programming is used to model the main objective and its related constraints in the form of nonlinear. In this model, decision variables are in the form of square. Fuzzy logic is used to reduce the imprecise information efficiently. The combination of both quadratic programming and fuzzy logic helps to model the main goal of the paper. Finally, the proposed method is formulated into LINGO optimization software to validate the main problem efficiently and effectively.
Article
Full-text available
Retrieving relevant documents from a large set using the original query is a formidable challenge. A generic approach to improve the retrieval process is realized using pseudo-relevance feedback techniques. This technique allows the expansion of original queries with conducive keywords that returns the most relevant documents corresponding to the original query. In this paper, five different hybrid techniques were tested utilizing traditional query expansion methods. Later, the boosting query term method was proposed to reweigh and strengthen the original query. The query-wise analysis revealed that the proposed approach effectively identified the most relevant keywords, and that was true even for short queries. All the proposed methods’ potency was evaluated on three different datasets; Roshni, Hamshahri1, and FIRE2011. Compared to the traditional query expansion methods, the proposed methods improved the mean average precision values of Urdu, Persian, and English datasets by 14.02%, 9.93%, and 6.60%, respectively. The obtained results were also established using analysis of variance and post-hoc analysis.
Chapter
Full-text available
The feature selection method plays a crucial role in text classification to minimizing the dimensionality of the features and accelerating the learning process of the classifier. Text classification is the process of dividing a text into different categories based on their content and subject. Text classification techniques have been applied to various domains such as medical, political, news, and legal domains, which show that the adaptation of domain-relevant features could improve the classification performance. Despite the existence of plenty of research work in the area of classification in several languages across the world, there is a lack of such work in Urdu due to the shortage of existing resources. In this paper, First, we present a proposed hybrid feature selection approach (HFSA) for text classification of Urdu news articles. Second, we incorporate widely used filter selection approaches along with Latent Semantic Indexing (LSI) to extract essential features of Urdu documents. The hybrid approach tested on the Support Vector Machine (SVM) classifier on Urdu “ROSHNI” dataset. The evaluated results were used to compare with the results obtained by individual filter feature selection methods. Also, the approach is compared to the baseline feature selection method. The proposed approach results show a better classification with promising accuracy and better efficiency.
Article
Full-text available
Named entity recognition(NER) and classification is a very crucial task in Urdu. One challenge among the others which makes Urdu NER task complex is the non-availability of enough linguistic resources. The NER research for English and other Western languages has a long tradition and significant amount of work has been done to solve NER problems in these languages. From resource availability aspect Western languages are counted resource plentiful languages. On the other hand, Urdu is far leg behind in terms of resources. In this paper we reported the development of NE tagged dataset for automated NER research in Urdu, especially with machine learning (ML) perspectives. The new developed Urdu NER dataset contains about 48000 words, comprising of 4621 named entities of seven named entity classes. The contents source of this new dataset is BBC Urdu and initially contains data from sport, national and international news domain. This new dataset can be used for training and testing purpose of various statistical and machine learning models such as e.g. hidden Markov model (HMM), maximum entropy (ME), Conditional random field(CRF), recurrent neural network (RNN) and so forth for conducting computational NER research in Urdu. Our goal is to bring in this new dataset freely widely acquirable, and to promote other researchers to exercise it as a criterial testbed for experimentations in Urdu NER research. In rest of the paper the new NER dataset will be referred as UNER dataset.
Chapter
Full-text available
In most of the classical information retrieval models, documents are represented as bag-of words which takes into account the term frequencies (tf) and inverse document frequencies (idf) while they ignore the term proximity. Recently, term proximity among query terms has been observed to be beneficial for improving performance of document retrieval. Several applications of the retrieval have implemented tools to determine term proximity at the query formulation level. They rank documents based on the relative positions of the query terms within the documents. They must store all proximity data in the index, leading to a large index, which slows the search. Recently, many models use term signal representation to represent a query term, the query is transformed from the time domain to the frequency domain using transformation techniques such as wavelet. Discrete Wavelet Transform (DWT) uses multiple resolutions technique by which different frequencies are analyzed with different resolutions. The advantage of the DWT is to consider the spatial information of the query terms within the document rather than using only the count of terms. In this paper, in order to improve ranking score as well as improve the run-time efficiency to resolve the query, and maintain a reasonable space for the index, three different types of spectral analysis based on semantic segmentation are carried out namely: sentence-based segmentation, paragraph-based segmenta-tion and fixed length segmentation; and also different term weighting is performed according to term position.
Article
Full-text available
Abstract—Most of the information retrieval (IR) models rank the documents by computing a score using only the lexicographical query terms or frequency information of the query terms in the document. These models have a limitation as they does not consider the terms proximity in the document or the term-mismatch or both of the two. The terms proximity information is an important factor that determines the relatedness of the document to the query. The ranking functions of the Spectral-Based Information Retrieval Model (SBIRM) consider the query terms frequency and proximity in the document by comparing the signals of the query terms in the spectral domain instead of the spatial domain using Discrete Wavelet Transform (DWT). The query expansion (QE) approaches are used to overcome the word-mismatch problem by adding terms to query, which have related meaning with the query. The QE approaches are divided to statistical approach Kullback-Leibler divergence (KLD) and semantic approach PWNET that uses WordNet. These approaches enhance the performance. Based on the foregoing considerations, the objective of this research is to build an efficient QESBIRM that combines QE and proximity SBIRM by implementing the SBIRM using the DWT and KLD or P-WNET. The experiments conducted to test and evaluate the QESBIRM using Text Retrieval Conference (TREC) dataset. The result shows that the SBIRM with the KLD or P-WNET model outperform the SBIRM model in precision (P@), R-precision, Geometric Mean Average Precision (GMAP) and Mean Average Precision (MAP).
Article
Full-text available
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.
Article
Automatic query expansion has long been suggested as a technique for dealing with the fundamental issue of word mismatch in information retrieval. A number of approaches to expansion have been studied and, more recently, attention has focused on techniques that analyze the corpus to discover word relationship (global techniques) and those that analyze documents retrieved by the initial query ( local feedback). In this paper, we compare the effectiveness of these approaches and show that, although global analysis haa some advantages, local analysia is generally more effective. We also show that using global analysis techniques.
Conference Paper
Automatic query reformulation refers to rewriting a user’s original query in order to improve the ranking of retrieval results compared to the original query. We present a general framework for automatic query reformulation based on discrete optimization. Our approach, referred to as pseudo-query reformulation, treats automatic query reformulation as a search problem over the graph of unweighted queries linked by minimal transformations (e.g. term additions, deletions). This framework allows us to test existing performance prediction methods as heuristics for the graph search process. We demonstrate the effectiveness of the approach on several publicly available datasets.
Article
Automatic query reformulation refers to rewriting a user's original query in order to improve the ranking of retrieval results compared to the original query. We present a general framework for automatic query reformulation based on discrete optimization. Our approach, referred to as pseudo-query reformulation, treats automatic query reformulation as a search problem over the graph of unweighted queries linked by minimal transformations (e.g. term additions, deletions). This framework allows us to test existing performance prediction methods as heuristics for the graph search process. We demonstrate the effectiveness of the approach on several publicly available datasets.