ArticlePDF Available

A query term re-weighting approach using document similarity

November 2015
Information Processing & Management 52(3)

November 2015
52(3)

DOI:10.1016/j.ipm.2015.09.002

Authors:

Maseud Rahgozar

University of Tehran

Farhad Oroumchian

University of Wollongong in Dubai

Pseudo-relevance feedback is the basis of a category of automatic query modification techniques. Pseudo-relevance feedback methods assume the initial retrieved set of documents to be relevant. Then they use these documents to extract more relevant terms for the query or just re-weigh the user's original query. In this paper, we propose a straightforward, yet effective use of pseudo-relevance feedback method in detecting more informative query terms and re-weighting them. The query-by-query analysis of our results indicates that our method is capable of identifying the most important keywords even in short queries. Our main idea is that some of the top documents may contain a closer context to the user's information need than the others. Therefore, re-examining the similarity of those top documents and weighting this set based on their context could help in identifying and re-weighting informative query terms. Our experimental results in standard English and Persian test collections show that our method improves retrieval performance, in terms of MAP criterion, up to 7% over traditional query term re-weighting methods.

Distribution of queries over the categories in Hamshahri collections.

…

Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting in Hamshahri 1.

…

Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting in Hamshahri 2.

…

Retrieval performance of DS weighting for different number of selected top documents in Hamshahri 1.

…

Evaluation results for DS weighting in Hamshahri data sets. α and β indicate statistically significant improvements over language modeling and WIG term re-weighting, respectively.

…

Figures - uploaded by Farhad Oroumchian

Content may be subject to copyright.

Content uploaded by Farhad Oroumchian

Content may be subject to copyright.

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Information Processing and Management 000 (2015) 1–12

Contents lists available at ScienceDirect

Information Processing and Management

journal homepage: www.elsevier.com/locate/ipm

A query term re-weighting approach using document similarity

Payam Karisania,∗, Maseud Rahgozara, Farhad Oroumchianb

aDatabase Research Group, Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering,

University of Tehran, Iran

bUniversity of Wollongong, Dubai

article info

Article history:

Received 27 May 2 015

Revised 19 September 2015

Accepted 23 September 2015

Available online xxx

Keywo rds:

Text r etr ieva l

Query term re-weighting

Document similarity

Query expansion

abstract

Pseudo-relevance feedback is the basis of a category of automatic query modiﬁcation tech-

niques. Pseudo-relevance feedback methods assume the initial retrieved set of documents to

be relevant. Then they use these documents to extract more relevant terms for the query or

just re-weigh the user’s original query. In this paper, we propose a straightforward, yet effec-

tive use of pseudo-relevance feedback method in detecting more informative query terms and

re-weighting them. The query-by-query analysis of our results indicates that our method is

capable of identifying the most important keywords even in short queries. Our main idea is

that some of the top documents may contain a closer context to the user’s information need

than the others. Therefore, re-examining the similarity of those top documents and weighting

this set based on their context could help in identifying and re-weighting informative query

terms. Our experimental results in standard English and Persian test collections show that our

method improves retrieval performance, in terms of MAP criterion, up to 7% over traditional

query term re-weighting methods.

1. Introduction

The traditional computer-based IR is concentrated on techniques that improve the performance of retrieval systems. Examples

of such techniques are probabilistic or language modeling (Craswell, Robertson, Zaragoza, & Taylor, 2005; Zaragoza, Craswell,

Taylor, Saria, & Robertson, 2004), personalized search (Croft, Cronen-Townsend, & Lavrenko, 2001; Sieg, Mobasher, & Burke,

2007), query classiﬁcation (Kang & Kim, 2003), and query modiﬁcation (Lavrenko & Croft, 2001; Lee, Croft, & Allan, 2008). Query

modiﬁcation techniques are a group of models that try to improve the retrieval performance by improving the original user

query. There are two main classes of query modiﬁcation methods. The ﬁrst class is called query expansion in which the system

reformulates the user query (Lavrenko & Croft, 2001; Lee, Croft, & Allan, 2008) by adding extra terms and re-weighting the query

terms. The second class however, concentrates only on re-weighting the query terms (Bendersky & Croft, 2008; Robertson &

Jones, 1976).

In this paper, we propose an approach to query modiﬁcation through query term re-weighting. We use automatic feedback

to retrieve the ﬁrst set of relevant documents, and then we extract the information which is needed for assigning a meaningful

weight to each query term. Our experimental results in English and Persian languages indicate that our method outperforms

traditional query term re-weighting approaches.

The rest of this paper is organized as follows: Section 2 provides an overview of the related studies. Section 3 presents our

approach to query term re-weighting in detail. Section 4 reports our results, i.e., Section 4.1 explains our experimental setup,

∗Corresponding author. Tel.: +98 2182089718.

E-mail addresses: p.karisani@gmail.com (P. Karisani), rahgozar@ut.ac.ir (M. Rahgozar), oroumchian@acm.org (F. Oroumchian).

http://dx.doi.org/10.1016/j.ipm.2015.09.002

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

2P. Karisani et al. /Information Processing and Management 000 (2015) 1–12

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Sections 4.2 and 4.3 present our results in English and Persian data sets, and Section 4.4 discusses the method. Finally, Section 5

concludes the paper.

2. Related work

Substantial amount of work has been done (Bendersky & Croft, 2008; Lavrenko & Croft, 2001; Lee, Croft, & Allan, 2008;

Robertson & Jones, 1976) in English Information Retrieval. Several research studies have inﬂuenced our work in one way or

another. Lee, Croft, and Allan (2008) propose a method based on local clustering hypothesis. The cluster hypothesis states that a

group of similar documents tend to be relevant to the same query. Using a K-NN method they cluster the top retrieved documents,

and rank the clusters based on the likelihood of generating the query. Then using the relevance model (Lavrenko & Croft, 2001)

they extract the new terms for expansion from the documents which belong to the top clusters. In their method, the documents

which appear in several clusters are called dominant. Their hypothesis is that these documents have a good representation of the

topics of the query. Because they appear multiple times in the clusters, they can contribute more to the expansion process and

improve the precision. Liu, Natarajan, and Chen (2011) use local clustering to propose a novel method for query suggestion. Based

on the number of clusters which exist in the top documents, their goal is to suggest a diversiﬁed set of expanded queries to the

user. Their assumption is that this set of queries will cover all the topics which are related to ambiguous user queries. The result

of each query in the set, when is ran against the collection, should be the corresponding cluster with the highest precision and

recall. They prove that this problem is NP-hard and try to propose two algorithms which predict the queries. While our method

like these methods tries to extract the information which the top documents carry, there are still some differences. First, we do

not add new terms to the query. The information which is extracted is used to re-weigh the original query terms. Second, our

approach to extract the information is different. We do not cluster the top documents; instead, we treat each one as a single

entity which carries information.

One of the ﬁrst studies on query term re-weighting has been carried out by Robertson and Jones (1976). Their approach is

based on the probabilistic retrieval model. The main idea of the probabilistic model is that there is a set of documents which

exactly contains all the related documents. Using the properties of this set we could retrieve the related documents. Because we

do not have access to the set we try to guess the properties. Thus an initial guess is made about the weights of the query terms

to retrieve the ﬁrst set of documents. In the next step, using an incidence contingency table over the top documents the weights

of the query terms are reﬁned to retrieve the ﬁnal set. Here we do not use probabilistic framework, and we also try to use the

information which the top documents carry in relation to each other. There is no such a step in the Robertson’s model.

Bendersky and Croft (2008) propose a framework to discover key concepts in verbose queries. First, they propose a model

based on language modeling approaches to incorporate concept weights into the retrieval process. Then they deﬁne a function

which estimates the membership of terms in the set of related concepts to the query. The normalized version of this function

is used in their retrieval process. To evaluate the value of this function they use a machine learning approach. In their method

concepts are mapped to a feature vector. The values of the vector are several query-dependent and query-independent features.

One of their most effective features is the Weighted Information Gain (Zhou & Croft, 2007) which we discuss in Section 4.Here

we also focus on short queries. Besides, we directly map terms to the corresponding weights because we only use one resource,

which is the top documents.

Recently many studies have been conducted in Persian text retrieval. Saboori, Bashiri, and Oroumchian (2012)investigated

the role of query term re-weighting using vector space model (Salton, Wong, & Yang, 1975). Hakimian and Taghiyareh (2008)

tried optimizing the parameters of Local Context Analysis (Xu & Croft, 2000). The role of N-gram based vector space model and

Local Context Analysis approach has been studied in Aleahmad, Hakimian, Mahdikhani, and Oroumchian (2007).

In this research, we demonstrate that query term re-weighting can be useful even in short queries—those with about three

terms. Furthermore, we propose a straightforward, yet effective method for estimating the importance of query terms. An im-

mediate impact of our work would be achieving a higher performance in document retrieval through emphasizing those terms

in more elaborate weighting schemes.

Our main motivation for doing this research was the amount of work which has been carried out in this area about verbose

queries. Much research has concentrated around long queries, since it is intuitive to assume identifying and eliminating less

inﬂuential terms in long queries could boost the performance. However, there are not many research studies that speciﬁcally

investigate the role of keyword detection in short queries. Therefore, it was felt that such an effort is needed to understand the

contribution connections of terms in all kinds of queries. Apart from this aim, other requirements of our work are simplicity and

robustness in order to make our methods suitable for real world scenarios. We achieve simplicity by only using attributes that

readily available at run time. The robustness of our method comes from the fact that we do not rely on a single evidence to assign

our weights; instead, we use several ﬁlters and steps to ensure the effectiveness of the process.

3. Proposed term re-weighting method

In this section, we present our term re-weighting method. First, we use the original user’s query to retrieve the initial relevant

documents; then we assign a weight to each relevant document which deﬁnes the importance of that document to the user’s

information need. Finally, we modify the weight of each query term based on their occurrence in these weighted documents. Our

method can be categorized as one of the local feedback query modiﬁcation methods. Local feedback query modiﬁcation methods

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 3

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

use the context in the documents that are retrieved for a given query in the ﬁrst phase to reformulate the query. In Sections 3.1

and 3.2 we introduce the calculation of weights for each term in the re-weighting schema.

3.1. Local feedback using document similarity

In the local feedback methods, the main source of extracting information regarding the user’s information need is the initial

retrieved documents from the ﬁrst phase. For instance, Bendersky and Croft (2008)andLiu, Natarajan and Chen (2011) use the

initial retrieved documents to detect more effective words in order to add them to the user’s original query. In our approach,

we use these documents to weigh the user’s original query terms because we believe this will achieve a more effective way

of using the original query terms. Assuming the top retrieved documents to be relevant is a common assumption in pseudo-

relevance feedback methods. Although this assumption carries the danger of query drift (Manning, Raghavan, & Schütze, 2008),

it is reported that using the top retrieved documents, in a controlled way, improves retrieval effectiveness signiﬁcantly (Lavrenko

& Croft, 2001; Lee, Croft, & Allan, 2008; Robertson & Jones, 1976).

We can represent the original query by the vector Q as Q={q1,q2,...}where qidenotes the ith query term. We also deﬁne

the ﬁnal weight of each query term qias follows:

Wqi=



j=1

Wdj

qi(1)

In Eq. (1),Wqidenotes the ﬁnal weight of qi,Nis the number of selected top documents from the initial retrieved documents, and

Wdj

qidenotes the weight contributed by document dj,jth retrieved document, because of the query term qi.

Our hypothesis is that Wdj

qiis valuable if djis truly relevant to the user’s information need; that is, although our retrieval

engine retrieved djas a relevant document, there is a chance that djmight not be as relevant as it should. To address the issue

of the importance of the documents in the retrieved document set, we can deﬁne Wdj

qitheadjustedweightofthetermqiin the

document dj, as below:

Wdj

qi=wdj

qi×vdj(2)

In Eq. (2),wdj

qidenotes the weight of the term qiin the document djand vdjdenotes the relevance of djto the user’s information

need.

The standard TF-IDF weighting model could be used for calculating the base weight of the query term qiin the document dj

(wdj

qiin Eq. (2)), as below:

wdj

qi=Fdj

qi×IDFqi(3)

In Eq. (3),Fdj

qidenotes the frequency of qiin dj,andIDFqidenotes the inverse document frequency of qiin the whole collection

set. To calculate the importance of the document to the original query or vdj,inEq. (2), we assume that the initial retrieved

documents are a good prediction for the user’s information need. Thus we measure the relevance of each document to the user’s

information need by evaluating the distance of each document to the other documents in the retrieved set. The size of the

retrieved set is deﬁned experimentally. We use the following equation to compute the similarity of each top document to the

other documents in the set:

vdj=N

k=1,k= jSim−→

dk,−→

dj

N−1(4)

In Eq. (4),−→

dkand −→

djare the related Euclidean vectors of kth and jth documents in the retrieved set, Sim is cosine function for

evaluating the similarity between −→

dkand −→

dj,andNis the number of selected top documents from the initial retrieved documents.

Next, we can obtain Eq. (5) using, (2)–(4).

Wqi=



j=1

Fdj

qi×IDFqi×vdj(5)

Finally, we use the log normalization to smooth the calculated values:

Wqi=log 1+



j=1

Fdj

qi ×IDFqi×vdj(6)

The constant one is added in Eq. (6) to avoid having zero in the logarithm. Eq. (6) can be used to re-weigh query terms. Wqiin

this equation is proportional to the frequency of query terms in the top documents, and to the IDF of query terms in the whole

collection. Moreover, it is sensitive to the documents which have the highest similarity to the other top documents—through vdj.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

4P. Karisani et al. /Information Processing and Management 000 (2015) 1–12

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

3.2. Query based selection

Eq. (4) assumes that the optimal point for the weight of documents is the point which has the minimum distance from all

the top documents. Thus the closer a document is to the center of the cluster, the higher is the weight of the document. This is a

simplifying universal assumption which is made regardless of the actual behavior of the queries. For example, vague queries carry

more than one context in their results. Therefore, their result set can contain several clusters of documents. A vague query such as

“Frank Sinatra in L.A.” may retrieve documents about his personal life in L.A., his concerts in L.A., or even his song about L.A. This

is the drawback of the above assumption that could potentially cause promotion of documents with more general vocabulary in

our method.

To tackle this issue, we assign a higher weight to the documents which have a higher similarity to the user’s original query.

Therefore, documents in different clusters can get a high weight only if their topic is close to the topic of the original query. Based

on this intuition, Eq. (4) can be modiﬁed as below:

vdj=K×N

k=1,k= jSim−→

dk,−→

dj

N−1+(1−K)×Sim−→

dj,



QL

(7)

In Eq. (7),

Qdenotes the Euclidean vector of the original query, and variables Kand Lare constant and should be tuned through

experiments. Now we can replace Eq. (4) with Eq. (7) in Eq. (6) as below:

Wqi=log ⎛

⎝1+IDFqi×



j=1

Fdj

qi×K×N

k=1,k= jSim−→

dk,−→

dj

N−1+(1−K)×Sim−→

dj,



QL⎞

⎠(8)

In Eq. (8), the relation between qiand djis calculated twice:

1. When we multiply the term Fdj

qi×IDFqiby vdj.

2. When we use the term Sim(−→

dj,



Q).

To reduce the effect of this relation, ﬁrst, we deﬁne Qias follows:

Qi=Q−{qi}(9)

So if we omit qifrom Qwe will have Qi. Then, in Eq. (8), we replace Qwith Qiin order to reduce the number of times which

this relation is used. Thus we have:

Wqi=log ⎛

⎝1+IDFqi×



j=1

Fdj

qi×K×N

k=1,k= jSim−→

dk,−→

dj

N−1+(1−K)×Sim−→

dj,−→

QiL⎞

⎠(10)

Finally, in order to have a ﬁxed range of weighting values between 0 and 1, we normalize the ﬁnal weights of the terms in

each query using Wmax.W

max is the maximum weight of the terms in that query. In practice, we used the normalized values.

Eq. (10) is a simple formula; besides, this equation uses known deﬁnitions like TF-IDF and cosine similarity. However, what

makes this equation effective—as we will see in Section 4—is the arrangement of its components. First, the content of each top

document is emphasized through the multiplication of Fdj

qiand vdj. Thus the documents that cover more of query context will be

favored over those that only partially match the query context. Since the query re-weighting will use only the weight of the top

documents, out of context or noisy documents will have less chance of diluting weighting of the query terms. This factor becomes

even more important in real world situations where it can dampen the effect of spam documents. The second characteristic of

this model is dampening the effect of presence of a single query term in the retrieved documents. That is achieved through

the use of Qi; in fact, by measuring the similarity between −→

Qiand −→

dj, this equation ensures that the similarity between the

document and the query is not achieved through the presence of qiin dj. Otherwise, the documents which frequently use qiand

lack consistent use of other query terms can contribute to the weight of qimore than they should.

4. Results

We have evaluated our method in English and Persian language data sets. For English language, we have used the FIRE

(Majumder et al., 2010) corpus. The last version of this data set was published in 2011. For Persian language, we used versions

one and two of a standard data set named Hamshahri (AleAhmad, Amiri, Darrudi, Rahgozar, & Oroumchian, 2009).1,2Persian

language is an Indo-European language, and it is one of the dominant languages in the Middle East. This language primarily

is spoken in Iran, Tajikistan, and Afghanistan. In this section, ﬁrst, we explain our experimental setup, and then we report the

results.

1http://ece.ut.ac.ir/dbrg/hamshahri/index.html.

2http://www.hamshahrionline.ir/,http://en.wikipedia.org/wiki/Hamshahri.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 5

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Tabl e 1

Attributes of the data sets.

Attribute FIRE Hamshahri 1 Hamshahri 2

Collection size 0.99 GB 599 MB 1.43 GB

Encoding ASCII UTF-8 UTF-8

No. of documents 379,820 166,774 318,517

No. of unique terms 525,263 493,537 680,653

Average length of documents 290 terms 238 terms 283 terms

Average length of queries 3.4 3.1 3.5

No. of queries 50 100 50

Fig. 1. Distribution of documents in 9 major categories of Hamshahri collections.

4.1. Experimental setup

The aim of the Forum for Information Retrieval Evaluation (FIRE)3is to create an evaluation framework like TREC, CLEF, and

NTCIR. We used the last edition of their corpus and its queries (queries 126–175) which were published in 2011. In Persian

Language, we used two versions of the Hamshahri standard data set to evaluate our method. Hamshahri 1 (AleAhmad et al.,

2009) which contains the news articles of Hamshahri newspaper2from year 1996 to 2003, and Hamshahri 2 which includes the

news articles of this newspaper from year 1996 to 2007. Table 1 summarizes some attributes of these collections. It can be seen

that the average length of the queries in all data sets are about 3 terms. Technically what makes long queries different from short

queries is that short queries may not contain suﬃcient context for disambiguation of query terms. Therefore the information

need of the user may not easily be understood.

Figs. 1 and 2show the categories of Hamshahri data sets, and the distribution of their documents and queries over these

categories, respectively. For detailed information about FIRE data set, reader is referred to Majumder et al. (2010).

We used Luc ene44.8.1 for indexing and retrieval. Porter stemmer is used for stemming both English documents and queries.

Due to the lack of a good stemmer in Persian language, we did not perform any stemming in the Persian data sets. For stop word

removal, we used the standard INQUERY (Allan et al., 2000) stop word list for FIRE data set, and a list of 774 Persian common

words5for Hamshahri data sets. For query term re-weighting, we used the default approach of Lucene (called boosting) (Apache

Software Foundation), which multiplies the ﬁnal contribution of each query term to the score of a document by the weight which

is assigned to that query term. We also used R6tool for testing the signiﬁcance of the difference between our method and others.

We chose a language modeling approach similar to Zhai and Lafferty (2001) with Jelinek–Mercer smoothing as our base

model for comparison purposes. In this model, the documents are ranked by their probability of generating the query. Currently,

this model is one of the best retrieval models. Improving the performance over this model is quite challenging. Jelinek–Mercer

smoothing is a variation of language modeling that improves the performance of language modeling for queries with infrequent

3http://www.isical.ac.in/∼ﬁre/.

4http://lucene.apache.org/.

5http://ece.ut.ac.ir/dbrg/hamshahri/download.html.

6http://www.r-project.org/.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

6P. Karisani et al. /Information Processing and Management 000 (2015) 1–12

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Fig. 2. Distribution of queries over the categories in Hamshahri collections.

terms. Those are the terms that may not appear enough number of times in the training sample set that is used for estimating

the initial probabilities in the model. This method uses a linear interpolation technique to smooth the maximum likelihood

document models using a coeﬃcient λas follows:

Pti|Mj=(1−λ)fi,j

kfk,j

+λFi

kFk

(11)

In Eq. (11),Mjdenotes the language model of the document djin the collection, fi, j denotes the frequency of the term tiin dj,

and Fidenotes the frequency of tiin the whole collection.

In order to compare our method with another re-weighting model, we have implemented the Weighted Information Gain

(WIG) method described in Zhou and Croft (2007) to re-weigh query terms. For a given query term, WIG measures the change

in information about the quality of retrieval from a state that only an average document is retrieved to a state that the actual

results are retrieved. Zhou and Croft (2007) hypothesize that WIG is positively correlated with retrieval effectiveness, because

high quality retrieval should be more effective than returning an average document. Therefore, we expect the WIG method to

assign a higher weight to the more important query terms. Bendersky and Croft (2008) have reported their experiments for

discovering key concepts in verbose queries using WIG along with other common measures (like TF and IDF). Their experiments

show that WIG is one of the most effective methods for concept re-weighting. We used normalized WIG in our experiments

which is deﬁned as below:

wig(qi)=

Nd∈TN(qi)log p(qi|d)−log p(qi|C)

−log p(qi|C)(12)

In Eq. (12),wig(qi) denotes the weight which is assigned to the query term qi,T

N(qi) denotes the top document set which is

retrieved in response to query term qi,Nis the number of selected top documents, p(qi|d) is maximum likelihood estimate which

is calculated using Eq. (11),andp(qi|C)iscalculatedasbelow:

p(qi|C)=Fi

jFj

(13)

In Eq. (13),Fiis the frequency of the term tiin the whole collection. We believe comparing our method with both a language

modeling and a query re-weighting method enables us to better understand the general performance of our method.7

For Hamshahri 1 collection, we divided the queries into two sets, the ﬁrst 50 queries were used for learning and estimating

the parameters, and the second 50 queries were used for the evaluation. In FIRE and Hamshahri 2 data sets, However, we used

standard 10 fold cross validation for evaluation. Thus in each step we used 90% of the queries for the training procedure, and 10%

for the test procedure.

In the training procedure, we used MAP criterion to ﬁnd the best parameter setting. For Jelinek–Mercer smoothing, the value

of λwas optimal at 0.2; we used this value for the retrieval process and WIG re-weighting approach. Moreover, we experimented

with different retrieved set sizes (N={10, 20, 30, 40, 50, 60, 70, 80, 90, 100}) in Eq. (12).InEq. (10), there are three parameters

7We also implemented Robertson’s probabilistic model with query term re-weighting. Due to lack of any signiﬁcant improvements over the baseline, we did

not report the results here.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 7

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Tabl e 2

Evaluation results for DS weighting on FIRE data set. αand β

indicate statistically signiﬁcant improvements over language

modeling and WIG weighting, respectively.

FIRE

Model MAP P@10 R-precision

Language modeling 0.2503 0.362 0.2898

WIG weighting 0.2476 0.356 0.2855

DS weighting 0.2684αβ 0.368 0.2996β

Fig. 3. Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting on FIRE data set.

Tabl e 3

Evaluation results for DS weighting in Hamshahri data sets. αand βindicate statistically signiﬁcant

improvements over language modeling and WIG term re-weighting, respectively.

Hamshahri 1 Hamshahri 2

Model MAP P@10 R-precision MAP P@10 R-precision

Language modeling 0.3339 0.556 0.3676 0.3958 0.628 0.4231

WIG weighting 0.3387 0.562 0.3705 0.4053 0.636 0.427

DS weighting 0.3577αβ 0.588 0.3818α0.4293αβ 0.65 0.4519αβ

which must be estimated. N, K,andL. We have experimented with the following values and their combinations for the three

parameters: N: {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}, K: {0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, and L: {1, 2, 3, 4, 5}.

4.2. Experimental results in English language

Table 2 shows the performance of our approach in comparison with WIG term re-weighting approach and simple language

modeling on FIRE data set. All three methods use the same language modeling for the retrieval of documents in the ﬁrst phase.

However, WIG and our method (DS8) use a set of top documents to re-weigh the query terms. Both our method and WIG use the

re-weighted query to retrieve the ﬁnal result set.

The achieved results indicate that our method improves the retrieval performance, in terms of MAP, up to 7.23% over language

modeling, and up to 8.4% over WIG term re-weighting, which is signiﬁcant using paired t-test at p<0.05. Fig. 3 plots the precision–

recall curves for the same three models in Table 2.

4.3. Experimental results in Persian language

Table 3 shows the performance of our approach (DS) in comparison with WIG term re-weighting approach and simple lan-

guage modeling on Hamshahri data sets. We can observe that query term re-weighting using our approach improves retrieval

8Document Similarity.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

8P. Karisani et al. /Information Processing and Management 000 (2015) 1–12

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Fig. 4. Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting in Hamshahri 1.

Fig. 5. Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting in Hamshahri 2.

performance, in terms of MAP, up to 7.12% over language modeling, and up to 5.6% over WIG term re-weighting on Hamshahri 1

data set. Furthermore, improvements are higher in Hamshahri 2 data set; up to 8.45% over language modeling, and up to 5.92%

over WIG term re-weighting.

Figs. 4 and 5present the precision–recall curves for language modeling, WIG, and DS term re-weighting approaches on

Hamshahri 1 and Hamshahri 2, respectively.

Table 4 provides a query-by-query comparison of precision results for DS term re-weighting, language modeling and WIG

term re-weighting on Hamshahri 1 test collection. The queries are sorted by their improvement over language modeling from

high to low. We observe that our method improves the performance of 66% of the queries over language modeling. Moreover,

we can see that the improved queries are ranged from the queries with low performance (like query numbers 3 and 50) to the

queries with high performance (like query numbers 7 and 15). We have categorized the queries into two sets: speciﬁc or broad.

Although some of the broad queries also have improved but most of the improvements come from speciﬁc type queries. This

phenomenon could be explained by the nature of these broad type queries and the fact that these queries are short and lack

discriminative keywords.

Table 5 shows a number of queries from Hamshahri 1 dataset and their relative results. The weight of each term is shown in

brackets. Columns 3 and 4 show the performance of each query in language modeling and DS term re-weighting. Note that the

equivalence of some English words in Persian language (like “copyright” or “rationing”) have two parts; their weights are listed,

respectively. Besides, word “Yugoslavia” in query number 6, has zero weight. This word has two spelling in Persian language,

“”and“ ”, so there is a spelling mismatch between the form which is used in the query 6 and what is in Hamshahri

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 9

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Tabl e 4

DS weighting query improvements in comparison to language modeling and WIG weighting in Hamshahri

1. The rows are sorted by improvements over language modeling (over LM %).

Query no. Length Category LM MAP WIG MAP DS MAP Over LM % Over WIG %

3 4 Speciﬁc 0.0673 0.084 0.1852 175.18 120.47

50 3 Broad 0.0793 0.1132 0.1612 103.27 42.40

28 4 Broad 0.1221 0.1515 0.2077 70.10 37.09

7 3 Speciﬁc 0.423 0.4507 0.659 55.79 46.21

46 3 Speciﬁc 0.1642 0.1657 0.2313 40.86 39.58

43 3 Speciﬁc 0.15 0.1717 0.2034 35.6 18.46

42 3 Speciﬁc 0.3926 0.3993 0.5226 33.11 30.87

41 5 Speciﬁc 0.1906 0.196 4 0.2535 33.00 29.07

29 3 Broad 0.4537 0.4683 0.5734 26.38 22.44

10 4 Speciﬁc 0.3735 0.3742 0.4673 25.11 24.87

30 3 Broad 0.1593 0.1375 0.1986 24.67 44.43

31 4 Speciﬁc 0.1761 0.1762 0.2145 21.80 21.73

15 4 Speciﬁc 0.4539 0.4535 0.5444 19.93 20.04

38 4 Speciﬁc 0.1583 0.1605 0.1761 11.24 9.71

48 3 Speciﬁc 0.1477 0.1476 0.162 9.68 9.75

9 4 Speciﬁc 0.2729 0.2711 0.2979 9.16 9.88

12 2 Broad 0.2475 0.2545 0.2635 6.46 3.53

14 3 Speciﬁc 0.2745 0.2723 0.2915 6.19 7.05

23 4 Speciﬁc 0.4684 0.4737 0.4914 4.91 3.73

2 3 Speciﬁc 0.6079 0.6213 0.6342 4.32 2.07

32 3 Speciﬁc 0.3352 0.3497 0.3479 3.78 –0.51

6 4 Speciﬁc 0.6103 0.6053 0.6314 3.45 4.31

25 4 Speciﬁc 0.1348 0.1462 0.1393 3.33 –4.71

49 4 Speciﬁc 0.0813 0.1128 0.0837 2.95 –25.79

45 4 Broad 0.1816 0.1963 0.1864 2.64 –5.04

26 3 Broad 0.3604 0.3646 0.3697 2.58 1.39

4 2 Broad 0.3853 0.3898 0.3918 1.68 0.51

27 2 Broad 0.6171 0.6181 0.6226 0.89 0.72

8 2 Speciﬁc 0.5259 0.5239 0.5292 0.62 1.01

16 3 Broad 0.8089 0.8123 0.8114 0.30 –0.11

19 4 Speciﬁc 0.2895 0.2902 0.29 0.17 −0.06

5 4 Speciﬁc 0.958 0.958 0.9596 0.16 0.16

36 2 Broad 0.163 0.1601 0.1632 0.12 1.93

44 2 Broad 0.5607 0.5609 0.5607 0 −0.03

39 3 Broad 0.9101 0.9105 0.9098 −0.03 −0.07

35 5 Speciﬁc 0.1132 0.1144 0.1126 −0.53 −1.57

1 3 Speciﬁc 0.1417 0.142 0.1406 −0.77 −0.98

13 2 Broad 0.4916 0.4893 0.4873 −0.87 −0.40

17 2 Broad 0.4458 0.4546 0.4417 −0.91 −2.83

47 3 Speciﬁc 0.562 0.561 0.5554 −1.17 −0.99

24 4 Speciﬁc 0.1417 0.1422 0.14 −1.19 −1.54

20 4 Speciﬁc 0.5005 0.5009 0.4895 −2.19 −2.27

34 3 Speciﬁc 0.4904 0.4872 0.4776 −2.61 −1.97

18 3 Broad 0.1464 0.1454 0.1411 −3.62 −2.95

37 4 Speciﬁc 0.3228 0.3189 0.306 −5.20 −4.04

33 6 Speciﬁc 0.1932 0.2037 0.1804 −6.62 −11.4 3

40 3 Broad 0.1551 0.1478 0.1383 −10 .83 −6.42

11 3 Speciﬁc 0.4322 0.4301 0.3838 −11 .19 −10.76

22 3 Speciﬁc 0.1792 0.1804 0.1529 −14 .67 −15.24

21 3 Speciﬁc 0.0757 0.0734 0.0 05 −93.39 −93.18

1 collection. Table 5 indicates that even in short queries it is possible to achieve improvement in the performance through

assigning higher weights to the more important query terms, and our method is partially successful in accomplishing this task.

However, there are some cases like query numbers 9 and 10, which do not contain a clear keyword in their terms; those are

queries which our method cannot improve or even may cause query drift.

4.4. Discussion

There are two main factors which play a central role in the performance of our approach:

1. The presence of keywords in the user’s original query; that is, there must be at least a term in the query which carry more

information in comparison with the other terms.

2. The number of relevant documents which are retrieved in response to this query, in the ﬁrst cycle.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

10 P. Karisani et al. /Information Processing and Management 000 (2015) 1–12

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Tabl e 5

Sample query term weights assigned by DS weighting method in Hamshari 1.

No. Query LM MAP DS MAP

1 0.6079 0.6342

(Heart[0.87] Disease[0.83] and Smoking[1])

2 0.423 0.659

(Commemorations[0.31] of Sadi[1] Shirazi[0.26])

3 0.3735 0.4673

(Beneﬁts[0.01] of Copyright[1, 0.98] Laws[0.39])

4 0.0673 0.1852

(Gas[1] Rationing[0.75, 0.42] in Iran[0.62])

5 0.4539 0.5444

(Remembrance[0.46] of Dr[0.77] Ali[0.54] Shariati[1])

60.12210.2077

(NATO[1] vs. Yugoslavia[0] War[0.49] in 1998[0.05])

7 0.4537 0.5734

(Global[0.21] Drought[1] Crisis[0.73])

8 0.1593 0.1986

(Iranian[0.86] Traditional[0.80] Celebrations[1])

9 0.1551 0.1383

(weave[0.88] rug[0.48, 1])

10 0.0757 0.005

(Television[0.19] and Mental[1] Health[0.96])

Fig. 6. Retrieval performance of DS weighting for different number of selected top documents in Hamshahri 1.

In order to measure the robustness of our method, we have experimented with the number of documents retrieved in the

ﬁrst phase. It is expected that the noise (number of non-relevant documents) to increase by increasing the number of documents

used from the ﬁrst phase. This noise could cause major problems for re-weighting by diluting the frequencies of important terms.

In our experiment, we ﬁxed the parameters Land Kat their optimal values, and evaluated MAP criterion for different values of

N, which is the number of selected top documents for the re-weighting process. Fig. 6 shows the result of this experiment in

Hamshahri 1. We observed that even if we increase the number of the selected documents up to 300, our retrieval performance is

better than LM weighting. This experiment shows that our term re-weighting approach is stable against non-relevant documents

which may enter into the top retrieved documents.

The presence of informative keywords is another important factor inﬂuencing the performance of the system. For instance,

query 7, which is “Commemorations of Sadi Shirazi”, has a precision of 0.93 at document cut off of 15 (P@15). The term “Sadi”

(the Iranian poet) conveys more information than the terms “Commemorations” and ”Shirazi” (a reference to a city in Iran). As

aresult,Table 4 shows an improved MAP of up to 55.79%. On the other hand, query 21, which is “Television and Mental Health”

has the precision of 0 at the same document cut off. Because there is no clear informative keyword in this query to show the

intention of the user, we can see that it has affected the MAP value of this query dramatically.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 11

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Tabl e 6

The optimal parameters of DS weight-

ing in the data sets.

Parameters

Data set NK L

FIRE 20 0.9 4

Hamshahri 1 70 0.9 3

Hamshahri 2 20 0.9 4.4

Tabl e 7

Evaluation results for DS weighting using query descriptions. αand βindicate

statistically signiﬁcant improvementsover language modeling and WIG weighting,

respectively.

Data Set Model MAP P@10 R-precision

FIRE Language modeling 0.3104 0.466 0.333

WIG weighting 0.317 0.452 0.3484

DS weighting 0.3639αβ 0.504αβ 0.3869αβ

Hamshahri 1 Language modeling 0.285 0.5 0.3310

WIG weighting 0.302 0.51 0.3452

DS weighting 0.3544αβ 0.58αβ 0.3847αβ

Hamshahri 2 Language modeling 0.2846 0.538 0.3226

WIG weighting 0.3011 0.554 0.334

DS weighting 0.365αβ 0.592αβ 0.3954αβ

Table 6 shows the optimal values of the three parameters N, K,andLin the data sets. The values in the FIRE and Hamshahri 2

data sets are the average values of the corresponding parameters in each fold of the cross validation process. The parameters in

the folds were mostly similar. Thus in order to avoid reporting repeated values, Table 6 only shows the averages. We can observe

that there is no a ﬁxed point for parameter N(the number of top documents.) It varies from a data set to another. On the other

hand, the optimal value of parameter K(the coeﬃcient similarity of a document to other top documents) tends to favor the

documents which have a higher similarity to the other top documents than those which are more similar to the query. Since

the higher the value of parameter K, the more inﬂuential will be the value of Eq. (4) in the ﬁnal weightings. We predict this

behavior may change in the web environment. In the real world situations, due to the presence of spam documents in the top

list, overweighting top documents may cause query drift.

Regarding the execution time of our method, since we run the query for two times against the data set (once for retrieving the

top documents, and once for ﬁnal results,) our method is slower than the baseline (language modeling.) However, considering

that we re-formulate the query through re-weighting the original terms, our method is faster than the expansion methods which

add new terms to the query. Because adding new terms usually causes reduction in the retrieval speed.

We also did another experiment in order to measure the effectiveness of our method for longer queries. In the data sets, we

used the description of the queries instead of their titles to measure the performance. The average length of query descriptions

in the FIRE, Hamshshari 1, and Hamshahri 2 data sets are 7.76, 6.67, and 6.46 terms, respectively. Table 7 reports the result of this

experiment. The results indicate that, on average, the performance of the long queries is lower than their shorter equivalences in

the Hamshahri 1 and Hamshahri 2 data sets. This phenomenon is due to the presence of the terms that are not directly related

to the users’ information need. Our method improves the performance up to 24.35% and 28.25% on MAP criterion over language

modeling in these data sets. On the other hand, the results of the FIRE data set show that the performance of the long queries is

higher than the shorter ones. Although these results signify that the terms which are used in the query descriptions are accurate,

our method still manages to improve the performance up to 17.24% on MAP criterion over language modeling. That is because

our method was able to correctly detect the more informative keywords from among all the keywords in the queries.

5. Conclusions and future work

In this paper, we proposed a straightforward approach to query term re-weighting. Our approach uses the initial query to ﬁrst

retrieve a set of documents; then it weights each document based on its closeness to the user’s information need. These weights

are used in the recalculation of query term weights. Our approach improves the retrieval performance, in terms of MAP criterion,

by 7% over language modeling approach in three data sets. It also outperforms other query term re-weighting approaches such as

WIG term weighting model. We believe more sophisticated weighting methods can help to achieve even further improvements.

Therefore, in the future we try to look into various probabilistic frameworks to achieve better results.

References

AleAhmad, Abolfazl, Amiri, Hadi, Darrudi, Ehsan, Rahgozar, Masoud, & Oroumchian, Farhad (2009). Hamshahri: A standard Persian text collection. Knowledge-

Based Systems, 22(5), 382–387.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

12 P. Karisani et al. /Information Processing and Management 000 (2015) 1–12

ARTICLE IN PRESS

JID: IPM [m3Gsc;November 9, 2015;14:6]

Aleahmad, Abolfazl, Hakimian, Parsia, Mahdikhani, Farzad, & Oroumchian, Farhad (2007). N-gram and local context analysis for Persian text retrieval. In Proceed-

ings of the 9th international symposium on signal processing and its applications, ISSPA 2007. IEEE.

Allan, James, Connell, MargaretE, Croft, WBruce, Feng, Fang-Fang, Fisher, David, & Li, Xiaoyan (20 00). Inquery and trec-9.DTICDocument.

Apache Software Foundation. Tf-IDF similarity (lucene 4.8.1api). Av ailab le from : http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/similarities/

TFIDFSimilarity.html.Accessed10.08.15.

Bendersky, Michael, & Croft, WBruce (2008). Discovering key concepts in verbose queries. In Proceedings of the 31st annual international ACM SIGIR conference on

research and development in information retrieval.ACM.

Craswell, Nick, Robertson, Stephen, Zaragoza, Hugo, & Taylor, Michael (2005). Relevance weighting for query independent evidence. In Proceedings of the 28th

annual international ACM SIGIR conference on research and development in information retrieval.ACM.

Croft, W Bruce, Cronen-Townsend, Stephen, & Lavrenko, Victor (2001). Relevance feedback and personalization: A language modeling perspective. In Proceedings

of the DELOS workshop: Personalisation and recommender systems in digital Libraries.

Hakimian, Parsia, & Taghiyareh, Fattaneh (20 08). Customizing local context analysis for farsi information retrieval by using a new concept weighting algorithm.

In Proceedings of the third international workshop on semantic media adaptation and personalization, 2008. SMAP’08.. IEEE.

Kang, In-Ho, & Kim, GilChang (2003). Query type classiﬁcation for web document retrieval. In Proceedings of the 26th annual international ACM SIGIR conference

on Research and development in informaion retrieval.ACM.

Lavrenko, Victor, & Croft, WBruce (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and

development in information retrieval.ACM.

Lee, Kyung Soon, Croft, WBruce, & Allan, James (2008). A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of the 31st annual

international ACM SIGIR conference on Research and development in information retrieval.ACM.

Liu, Ziyang, Natarajan, Sivaramakrishnan, & Chen, Yi (2011). Query expansion based on clustered results. Proceedings of the VLDB Endowment, 4(6), 350–361.

Majumder, Prasenjit, Mitra, Mandar, Pal, Dipasree, Bandyopadhyay, Ayan, Maiti, Samaresh, Pal, Sukomal, Modak, Deboshree, & Sanyal, Sucharita (2010). The FIRE

2008 evaluation exercise. ACM Transactions on Asian Language Information Processing (TALIP), 9(3), 10.

Manning, Christopher D, Raghavan, Prabhakar, & Schütze, Hinrich (2008). Introduction to information retrieval: Vol. 1. Cambridge: Cambridge University Press.

Robertson, Stephen E, & Jones, KSparck (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.

Saboori, F., Bashiri, H., & Oroumchian, Farhad (2012). Assessment of query reweighing, by rocchio method in farsi information retrieval. International Journal of

Information Science and Management (IJISM), 6(1), 9–16.

Salton, Gerard, Wong, Anita, & Yang, Chung-Shu (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

Sieg, Ahu, Mobasher, Bamshad, & Burke, Robin (2007). Web search personalization with ontological user proﬁles. In Proceedings of the sixteenth ACM conference

on information and knowledge management.ACM.

Xu, Jinxi, & Croft, WBruce (2000). Improving the effectiveness of information retrieval with local context analysis.ACM Transactions on Information Systems (TOIS),

18(1), 79–112.

Zaragoza, Hugo, Craswell, Nick,Taylor, MichaelJ., Saria, Suchi, & Robertson, StephenE.(20 04). Microsoft Cambridge at TREC 13: Web and hard tracks.InProceedings

of the text retrieval conference, TREC.

Zhai, Chengxiang, & Lafferty, John (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th

annual international ACM SIGIR conference on research and development in information retrieval.ACM.

Zhou, Yun, & Croft, WBruce (2007). Query performance prediction in web search environments. In Proceedings of the 30th annual international ACM SIGIR confer-

ence on research and development in information retrieval.ACM.

Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-

cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002

Neural Language Model Based Attentive Term Dependence Model for Verbose Query (Student Abstract)

Article

Jun 2023

The query-document term matching plays an important role in information retrieval. However, the retrieval performance degrades when the documents get matched with the extraneous terms of the query which frequently arises in verbose queries. To address this problem, we generate the dense vector of the entire query and individual query terms using the pre-trained BERT (Bidirectional Encoder Representations from Transformers) model and subsequently analyze their relation to focus on the central terms. We then propose a context-aware attentive extension of unsupervised Markov Random Field-based sequential term dependence model that explicitly pays more attention to those contextually central terms. The proposed model utilizes the strengths of the pre-trained large language model for estimating the attention weight of terms and rank the documents in a single pass without any supervision.

Hybrid Query Expansion Model Based on Pseudo Relevance Feedback and Semantic Tree for Arabic IR

Article

Full-text available

Jan 2022

In this paper, the authors propose and readapt a new concept-based approach of query expansion in the context of Arabic information retrieval. The purpose is to represent the query by a set of weighted concepts in order to identify better the user's information need. Firstly, concepts are extracted from the initially retrieved documents by the Pseudo-Relevance Feedback method, and then they are integrated into a semantic weighted tree in order to detect more information contained in the related concepts connected by semantic relations to the primary concepts. The authors use the “Arabic WordNet” as a resource to extract, disambiguate concepts and build the semantic tree. Experimental results demonstrate that measure of MAP (Mean Average Precision) is about 10% of improvement using the open source Lucene as IR System on a collection formed from the Arabic BBC news.

Semantic-Based Hybrid Query Reformulation for Biomedical Information Retrieval

Article

Full-text available

Jul 2022

Query reformulation is a well-known technique intended to improve the performance of Information Retrieval Systems. Among the several available techniques, Query Expansion (QE) reformulates the initial query by adding similar terms, drawn from several sources (corpus, knowledge resources), to the query terms in order to retrieve more relevant documents. Most QE methods are based on the relationships between the original query term and candidate terms (new terms) in order to select the most similar expansion terms. In this paper, we suggested a new hybrid query reformulation through QE and term re-weighting techniques. The suggested approach aimed to demonstrate the effectiveness of QE with a semantic selection of candidate terms according to the specificity of original query terms in the improvement of retrieval performance. To this end, we exploited both relationships defined by knowledge resources and the distributed semantics, recently revealed by neural network analysis. For term re-weighting, we proposed a new semantic method based on semantic similarity measure that assigns a weight to each term of the expanded query. The conducted experiments on OHSUMED and TREC 2014 CDS test collections, including long and short queries, yielded significant results that outperformed the baseline and state-of-the-art approaches.

A Novel Text Classification Technique Using Improved Particle Swarm Optimization: A Case Study of Arabic Language

Article

Full-text available

Jun 2022

We propose a novel text classification model, which aims to improve the performance of Arabic text classification using machine learning techniques. One of the effective solutions in Arabic text classification is to find the suitable feature selection method with an optimal number of features alongside the classifier. Although several text classification methods have been proposed for the Arabic language using different techniques, such as feature selection methods, an ensemble of classifiers, and discriminative features, choosing the optimal method becomes an NP-hard problem considering the huge search space. Therefore, we propose a method, called Optimal Configuration Determination for Arabic text Classification (OCATC), which utilized the Particle Swarm Optimization (PSO) algorithm to find the optimal solution (configuration) from this space. The proposed OCATC method extracts and converts the features from the textual documents into a numerical vector using the Term Frequency-Inverse Document Frequency (TF–IDF) approach. Finally, the PSO selects the best architecture from a set of classifiers to feature selection methods with an optimal number of features. Extensive experiments were carried out to evaluate the performance of the OCATC method using six datasets, including five publicly available datasets and our proposed dataset. The results obtained demonstrate the superiority of OCATC over individual classifiers and other state-of-the-art methods.

Pseudo relevance feedback optimization

Article

Full-text available

May 2021
INFORM RETRIEVAL

We propose a method for automatic optimization of pseudo relevance feedback (PRF) in information retrieval. Based on the conjecture that the initial query’s contribution to the final query may not be necessary once a good model is built from pseudo relevant documents, we set out to optimize per query only the number of top-retrieved documents to be used for feedback. The optimization is based on several query performance predictors for the initial query, by building a linear regression model discovering the optimal machine learning pipeline via genetic programming. Even by using only 50–100 training queries, the method yields statistically-significant improvements in MAP of 18–35% over the initial query, 7–11% over the feedback model with the best fixed number of pseudo-relevant documents, and up to 10% (5.5% on median) over the standard method of optimizing both the balance coefficient and the number of feedback documents by grid-search in the training set. Compared to state-of-the-art PRF methods from the recent literature, our method outperforms by up to 21% with an average of 10%. Further analysis shows that we are still far from the method’s effectiveness ceiling (in contrast to the standard method), leaving amble room for further improvements.

Query Sense Discovery Approach to Realize the User's Search Intent

Article

Full-text available

Jan 2022

The main goal of information retrieval is getting the most relevant documents to a user’s query. So, a search engine must not only understand the meaning of each keyword in the query but also their relative senses in the context of the query. Discovering the query meaning is a comprehensive and evolutionary process; the precise meaning of the query is established as developing the association between concepts. The meaning determination process is modeled by a dynamic system operating in the semantic space of WordNet. To capture the meaning of a user query, the original query is reformulating into candidate queries by combining the concepts and their synonyms. A semantic score characterizing the overall meaning of such queries is calculated, the one with the highest score was used to perform the search. The results confirm that the proposed "Query Sense Discovery" approach provides a significant improvement in several performance measures.

Effects of central tendency measures on term weighting in textual information retrieval

Article

Full-text available

Jun 2021
SOFT COMPUT

It has become evident that term weighting has a significant effect on relevant document retrieval for which various methods are proposed. However, the main question that arises is which weighting method is the best? In this paper, it is shown that proper aggregation of weights generated by carefully selected basic weighting methods improves retrieval of the relevant documents with respect to the user’s needs. Toward this aim, it is shown that even using simple central tendency measures such as average, median or mid-range over an appropriate subset of basic weighting methods provides term weight that not only outperforms using each basic weighting method but also results in more effective weights in comparison with recently proposed complicated weighting methods. Based on exploiting the proposed method on various datasets, we have studied the effects of normalization of the basic weights, normalization of the vector lengths, the use of different components in the term frequency factor, etc. Results reveal the criteria for selecting an appropriate subset of basic weighting methods that would be fed to the aggregator in order to achieve higher retrieval precision.

Query Expansion Using Proposed Location-Based Algorithm for Hindi–English CLIR: Analyzing Three Test Collections

Article

Mar 2024
INT J PATTERN RECOGN

The rapid growth of contents on the Web in different languages increases the demand of Cross-Lingual Information Retrieval (CLIR). The accuracy of result suffers due to many problems such as ambiguity and drift issue in query. Query Expansion (QE) offers reliable solution for obtaining suitable documents for user queries. In this paper, we proposed an architecture for Hindi–English CLIR system using QE for improving the relevancy of retrieved results. In this architecture, for the addition of term(s) at appropriate position(s), we proposed a location-based algorithm to resolve the drift query issue in QE. User queries in Hindi language have been translated into document language (i.e. English) and the accuracy of translation is improved using Back-Translation. Google search has been performed and the retrieved documents are ranked using Okapi BM25 to arrange the documents in the order of decreasing relevancy to select the most suitable terms for QE. We used term selection value (TSV) for QE and for retrieving the terms, we created three test collections namely the (i) description and narration of the Forum for Information Retrieval Evaluation (FIRE) dataset, (ii) Snippets of retrieved documents against each query and (iii) Nearest-Neighborhood (NN) words against each query word among the ranked documents. To evaluate the system, 50 queries of Hindi language are selected from the FIRE-2012 dataset. In this paper, we performed two experiments: (i) impact of the proposed location-based algorithm on the proposed architecture of CLIR; and (ii) analysis of QE using three datasets, i.e. FIRE, NN and Snippets. In the first case, result shows that the relevancy of Hindi–English CLIR is improved by performing QE using the location-based algorithm and a 12% of improvement is achieved as compared to the results of QE obtained without applying the location-based algorithm. In the second case, the location-based algorithm is applied on three datasets. The Mean Average Precision (MAP) values of retrieved documents after QE are 0.5379 (NN), 0.6018 (FIRE) and 0.6406 (Snippets) for the three test collections, whereas the MAP before QE is 0.37102. This clearly shows the significant improvement of retrieved results for all three test collections. Among the three test collections, QE has been found most effective along with Snippets as indicated by the results with the improvements of 6.48% and 19.12% over FIRE and NN test collections, respectively.

Semantics-aware query expansion using pseudo-relevance feedback

Article

Jul 2023
J INF SCI

In this article, a pseudo-relevance feedback (PRF)–based framework is presented for effective query expansion (QE). As candidate expansion terms, the proposed PRF framework considers the terms that are different morphological variants of the original query terms and are semantically close to them. This strategy of selecting expansion terms is expected to preserve the query intent after expansion. While judging the suitability of an expansion term with respect to a base query, two aspects of relation of the term with the query are considered. The first aspect probes to what extent the candidate term is semantically linked to the original query and the second one checks the extent to which the candidate term can supplement the base query terms. The semantic relationship between a query and expansion terms is modelled using bidirectional encoder representations from transformers (BERT). The degree of similarity is used to estimate the relative importance of the expansion terms with respect to the query. The quantified relative importance is used to assign weights of the expansion terms in the final query. Finally, the expansion terms are grouped into semantic clusters to strengthen the original query intent. A set of experiments was performed on three different Text REtrieval Conference (TREC) collections to experimentally validate the effectiveness of the proposed QE algorithm. The results show that the proposed QE approach yields competitive retrieval effectiveness over the existing state-of-the-art PRF methods in terms of the mean average precision (MAP) and precision P at position 10 (P@10).

A semantic approach to post-retrieval query performance prediction

Article

Jan 2022
INFORM PROCESS MANAG

The importance of query performance prediction has been widely acknowledged in the literature, especially for query expansion, refinement, and interpolating different retrieval approaches. This paper proposes a novel semantics-based query performance prediction approach based on estimating semantic similarities between queries and documents. We introduce three post-retrieval predictors, namely (1) semantic distinction, (2) semantic query drift, and (3) semantic cohesion based on (1) the semantic similarity of a query to the top-ranked documents compared to the whole collection, (2) the estimation of non-query related aspects of the retrieved documents using semantic measures, and (3) the semantic cohesion of the retrieved documents. We assume that queries and documents are modeled as sets of entities from a knowledge graph, e.g., DBPedia concepts, instead of bags of words. With this assumption, semantic similarities between two texts are measured based on the relatedness between entities, which are learned from the contextual information represented in the knowledge graph. We empirically illustrate these predictors’ effectiveness, especially when term-based measures fail to quantify query performance prediction hypotheses correctly. We report our findings on the proposed predictors’ performance and their interpolation on three standard collections, namely ClueWeb09-B, ClueWeb12-B, and Robust04. We show that the proposed predictors are effective across different datasets in terms of Pearson and Kendall correlation coefficients between the predicted performance and the average precision measured by relevance judgments.

Relevance weighting of search terms

Article

Full-text available

May 1976
J Am Soc Inform Sci

This paper examines statistical techniques for exploiting relevance information to weight search terms. These techniques are presented as a natural extension of weighting methods using information about the distribution of index terms in documents in general. A series of relevance weighting functions is derived and is justified by theoretical considerations. In particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval. Different applications of relevance weighting are illustrated by experimental results for test collections.

Inquery and TREC-9

Article

Full-text available

Nov 2012

This year the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts participated in three of the tracks: the cross-language, question answering, and query tracks. We used approaches that were similar to those used in past years. Although UMass used a wide range of tools, from Unix shell scripts, to PC spreadsheets, three major tools and techniques were applied across almost all tracks: the Inquery search engine, query processing, and a query expansion technique known as LCA. All three tracks used Inquery as the search engine, sometimes for training, and always for generating the final ranked lists for the test. In the cross language track, we experimented some techniques for crossing the character encoding boundaries. Our efforts were moderately successful, but we do not believe that our approach worked well in comparison to other techniques. In the question answering track, we focused on bringing answer-containing documents to the top of the ranked list. This is an important sub-task for most methods of tackling Q&A, and we are pleased with our results. We are now looking at alternate ways of thinking about that task that leverage the differences between retrieval for Q&A and for IR. Finally, we continued to participate in the query track, providing large numbers of query variants, and running our system on the huge number of resulting queries. Our analysis showed how query expansion compensates for some of the problems that can occurs in query formulation.

Hamshahri: A standard Persian text collection

Article

Full-text available

Jul 2009
KNOWL-BASED SYST

The Persian language is one of the dominant languages in the Middle East, so there are significant amount of Persian documents available on the Web. Due to the different nature of the Persian language compared to the other languages such as English, the design of information retrieval systems in Persian requires special considerations. However, there are relatively few studies on retrieval of Persian documents in the literature and one of the main reasons is the lack of a standard test collection. In this paper, we introduce a standard Persian text collection, named Hamshahri, which is built from a large number of newspaper articles according to TREC specifications. Furthermore, statistical information about documents, queries and their relevance judgments are presented in this paper. We believe that this collection is the largest Persian text collection, so far.

Web search personalization with ontological user profiles

Conference Paper

Full-text available

Nov 2007

Every user has a distinct background and a specific goal when searching for information on the Web. The goal of Web search personalization is to tailor search results to a particular user based on that user's interests and prefer- ences. Effective personalization of information access in- volves two important challenges: accurately identifying the user context and organizing the information in such a way that matches the particular context. We present an ap- proach to personalized search that involves building models of user context as ontological profiles by assigning implicitly derived interest scores to existing concepts in a domain on- tology. A spreading activation algorithm is used to maintain the interest scores based on the user's ongoing behavior. Our experiments show that re-ranking the search results based on the interest scores and the semantic evidence in an onto- logical user profile is effective in presenting the most relevant results to the user.

Relevance weighting for query independent evidence

Conference Paper

Full-text available

Aug 2005

A query independent feature, relating perhaps to document content, linkage or usage, can be transformed into a static, per-document relevance weight for use in ranking. The challenge is to find a good function to transform feature values into relevance scores. This paper presents FLOE, a simple density analysis method for modelling the shape of the transformation required, based on training data and without assuming independence between feature and baseline. For a new query independent feature, it addresses the questions: is it required for ranking, what sort of transformation is appropriate and, after adding it, how successful was the chosen transformation? Based on this we apply sigmoid transformations to PageRank, indegree, URL Length and ClickDistance, tested in combination with a BM25 baseline.

A study of smoothing methods for language models applied to ad hoc information retrieval

Article

Jan 2001

Introduction to Information Retrieval

Book

Jan 2008

Customizing Local Context Analysis for Farsi Information Retrieval by Using a New Concept Weighting Algorithm

Article

Dec 2008

A lot of digital Farsi content has been produced recently in middle-east. Local Context Analysis (LCA) is an automated query expansion method that adds concepts to the original query based on the initial retrieval using the original query. In our previous works we attempted to tune this method for Farsi language by manipulating three parameters which are number of concepts used for query expansion, number of initially retrieved documents for local feedback and number of passages for concept discovery and weighting. In this paper we seek to further customize this method for Farsi information retrieval. To compare our work to the previous attempts we have used Hamshahri collection and 60. We have experimented with different number of concepts and have also changed the concept weighting algorithm to improve retrieval performance.

Query performance prediction in web search environments

Conference Paper

Jul 2007

Current prediction techniques, which are generally designed for content-based queries and are typically evaluated on relatively homogenous test collections of small sizes, face serious challenges in web search environments where collections are significantly more heterogeneous and different types of retrieval tasks exist. In this paper, we present three techniques to address these challenges. We focus on performance prediction for two types of queries in web search environments: content-based and Named-Page finding. Our evaluation is mainly performed on the GOV2 collection. In addition to evaluating our models for the two types of queries separately, we consider a more challenging and realistic situation that the two types of queries are mixed together without prior information on query types. To assist prediction under the mixed-query situation, a novel query classifier is adopted. Results show that our prediction of web query performance is substantially more accurate than the current state-of-the-art prediction techniques. Consequently, our paper provides a practical approach to performance prediction in real-world web settings.

Relevance-based language models

Conference Paper

Sep 2001

We explore the relation between classical probabilistic models of information retrieval and the emerging language modeling approaches. It has long been recognized that the primary obstacle to effective performance of classical models is the need to estimate arelevance model: probabilities of words in the relevant class. We propose a novel technique for estimating these probabilities using the query alone. We demonstrate that our technique can produce highly accurate relevance models, addressing important notions of synonymy and polysemy. Our experiments show relevance models outperforming baseline language modeling systems on TREC retrieval and TDT tracking tasks. The main contribution of this work is an effective formal method for estimating a relevance model with no training data

A query term re-weighting approach using document similarity

Abstract and Figures

Recommended publications

Web Semantic Search with TUCUXI? (Extended Abstract)

Systematic review of Query Expansion in Persian Language

Pseudo-relevance feedback based query expansion using boosting algorithm

Tweet Expansion Method for Filtering Task in Twitter

Hamshahri: A standard Persian text collection