ArticlePDF Available

A query term re-weighting approach using document similarity

Authors:

Abstract and Figures

Pseudo-relevance feedback is the basis of a category of automatic query modification techniques. Pseudo-relevance feedback methods assume the initial retrieved set of documents to be relevant. Then they use these documents to extract more relevant terms for the query or just re-weigh the user's original query. In this paper, we propose a straightforward, yet effective use of pseudo-relevance feedback method in detecting more informative query terms and re-weighting them. The query-by-query analysis of our results indicates that our method is capable of identifying the most important keywords even in short queries. Our main idea is that some of the top documents may contain a closer context to the user's information need than the others. Therefore, re-examining the similarity of those top documents and weighting this set based on their context could help in identifying and re-weighting informative query terms. Our experimental results in standard English and Persian test collections show that our method improves retrieval performance, in terms of MAP criterion, up to 7% over traditional query term re-weighting methods.
Content may be subject to copyright.
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Information Processing and Management 000 (2015) 1–12
Contents lists available at ScienceDirect
Information Processing and Management
journal homepage: www.elsevier.com/locate/ipm
A query term re-weighting approach using document similarity
Payam Karisania,, Maseud Rahgozara, Farhad Oroumchianb
aDatabase Research Group, Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering,
University of Tehran, Iran
bUniversity of Wollongong, Dubai
article info
Article history:
Received 27 May 2 015
Revised 19 September 2015
Accepted 23 September 2015
Available online xxx
Keywo rds:
Text r etr ieva l
Query term re-weighting
Document similarity
Query expansion
abstract
Pseudo-relevance feedback is the basis of a category of automatic query modification tech-
niques. Pseudo-relevance feedback methods assume the initial retrieved set of documents to
be relevant. Then they use these documents to extract more relevant terms for the query or
just re-weigh the user’s original query. In this paper, we propose a straightforward, yet effec-
tive use of pseudo-relevance feedback method in detecting more informative query terms and
re-weighting them. The query-by-query analysis of our results indicates that our method is
capable of identifying the most important keywords even in short queries. Our main idea is
that some of the top documents may contain a closer context to the user’s information need
than the others. Therefore, re-examining the similarity of those top documents and weighting
this set based on their context could help in identifying and re-weighting informative query
terms. Our experimental results in standard English and Persian test collections show that our
method improves retrieval performance, in terms of MAP criterion, up to 7% over traditional
query term re-weighting methods.
© 2015 Elsevier Ltd. All rights reserved.
1. Introduction
The traditional computer-based IR is concentrated on techniques that improve the performance of retrieval systems. Examples
of such techniques are probabilistic or language modeling (Craswell, Robertson, Zaragoza, & Taylor, 2005; Zaragoza, Craswell,
Taylor, Saria, & Robertson, 2004), personalized search (Croft, Cronen-Townsend, & Lavrenko, 2001; Sieg, Mobasher, & Burke,
2007), query classification (Kang & Kim, 2003), and query modification (Lavrenko & Croft, 2001; Lee, Croft, & Allan, 2008). Query
modification techniques are a group of models that try to improve the retrieval performance by improving the original user
query. There are two main classes of query modification methods. The first class is called query expansion in which the system
reformulates the user query (Lavrenko & Croft, 2001; Lee, Croft, & Allan, 2008) by adding extra terms and re-weighting the query
terms. The second class however, concentrates only on re-weighting the query terms (Bendersky & Croft, 2008; Robertson &
Jones, 1976).
In this paper, we propose an approach to query modification through query term re-weighting. We use automatic feedback
to retrieve the first set of relevant documents, and then we extract the information which is needed for assigning a meaningful
weight to each query term. Our experimental results in English and Persian languages indicate that our method outperforms
traditional query term re-weighting approaches.
The rest of this paper is organized as follows: Section 2 provides an overview of the related studies. Section 3 presents our
approach to query term re-weighting in detail. Section 4 reports our results, i.e., Section 4.1 explains our experimental setup,
Corresponding author. Tel.: +98 2182089718.
E-mail addresses: p.karisani@gmail.com (P. Karisani), rahgozar@ut.ac.ir (M. Rahgozar), oroumchian@acm.org (F. Oroumchian).
http://dx.doi.org/10.1016/j.ipm.2015.09.002
0306-4573/© 2015 Elsevier Ltd. All rights reserved.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
2P. Karisani et al. /Information Processing and Management 000 (2015) 1–12
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Sections 4.2 and 4.3 present our results in English and Persian data sets, and Section 4.4 discusses the method. Finally, Section 5
concludes the paper.
2. Related work
Substantial amount of work has been done (Bendersky & Croft, 2008; Lavrenko & Croft, 2001; Lee, Croft, & Allan, 2008;
Robertson & Jones, 1976) in English Information Retrieval. Several research studies have influenced our work in one way or
another. Lee, Croft, and Allan (2008) propose a method based on local clustering hypothesis. The cluster hypothesis states that a
group of similar documents tend to be relevant to the same query. Using a K-NN method they cluster the top retrieved documents,
and rank the clusters based on the likelihood of generating the query. Then using the relevance model (Lavrenko & Croft, 2001)
they extract the new terms for expansion from the documents which belong to the top clusters. In their method, the documents
which appear in several clusters are called dominant. Their hypothesis is that these documents have a good representation of the
topics of the query. Because they appear multiple times in the clusters, they can contribute more to the expansion process and
improve the precision. Liu, Natarajan, and Chen (2011) use local clustering to propose a novel method for query suggestion. Based
on the number of clusters which exist in the top documents, their goal is to suggest a diversified set of expanded queries to the
user. Their assumption is that this set of queries will cover all the topics which are related to ambiguous user queries. The result
of each query in the set, when is ran against the collection, should be the corresponding cluster with the highest precision and
recall. They prove that this problem is NP-hard and try to propose two algorithms which predict the queries. While our method
like these methods tries to extract the information which the top documents carry, there are still some differences. First, we do
not add new terms to the query. The information which is extracted is used to re-weigh the original query terms. Second, our
approach to extract the information is different. We do not cluster the top documents; instead, we treat each one as a single
entity which carries information.
One of the first studies on query term re-weighting has been carried out by Robertson and Jones (1976). Their approach is
based on the probabilistic retrieval model. The main idea of the probabilistic model is that there is a set of documents which
exactly contains all the related documents. Using the properties of this set we could retrieve the related documents. Because we
do not have access to the set we try to guess the properties. Thus an initial guess is made about the weights of the query terms
to retrieve the first set of documents. In the next step, using an incidence contingency table over the top documents the weights
of the query terms are refined to retrieve the final set. Here we do not use probabilistic framework, and we also try to use the
information which the top documents carry in relation to each other. There is no such a step in the Robertson’s model.
Bendersky and Croft (2008) propose a framework to discover key concepts in verbose queries. First, they propose a model
based on language modeling approaches to incorporate concept weights into the retrieval process. Then they define a function
which estimates the membership of terms in the set of related concepts to the query. The normalized version of this function
is used in their retrieval process. To evaluate the value of this function they use a machine learning approach. In their method
concepts are mapped to a feature vector. The values of the vector are several query-dependent and query-independent features.
One of their most effective features is the Weighted Information Gain (Zhou & Croft, 2007) which we discuss in Section 4.Here
we also focus on short queries. Besides, we directly map terms to the corresponding weights because we only use one resource,
which is the top documents.
Recently many studies have been conducted in Persian text retrieval. Saboori, Bashiri, and Oroumchian (2012)investigated
the role of query term re-weighting using vector space model (Salton, Wong, & Yang, 1975). Hakimian and Taghiyareh (2008)
tried optimizing the parameters of Local Context Analysis (Xu & Croft, 2000). The role of N-gram based vector space model and
Local Context Analysis approach has been studied in Aleahmad, Hakimian, Mahdikhani, and Oroumchian (2007).
In this research, we demonstrate that query term re-weighting can be useful even in short queries—those with about three
terms. Furthermore, we propose a straightforward, yet effective method for estimating the importance of query terms. An im-
mediate impact of our work would be achieving a higher performance in document retrieval through emphasizing those terms
in more elaborate weighting schemes.
Our main motivation for doing this research was the amount of work which has been carried out in this area about verbose
queries. Much research has concentrated around long queries, since it is intuitive to assume identifying and eliminating less
influential terms in long queries could boost the performance. However, there are not many research studies that specifically
investigate the role of keyword detection in short queries. Therefore, it was felt that such an effort is needed to understand the
contribution connections of terms in all kinds of queries. Apart from this aim, other requirements of our work are simplicity and
robustness in order to make our methods suitable for real world scenarios. We achieve simplicity by only using attributes that
readily available at run time. The robustness of our method comes from the fact that we do not rely on a single evidence to assign
our weights; instead, we use several filters and steps to ensure the effectiveness of the process.
3. Proposed term re-weighting method
In this section, we present our term re-weighting method. First, we use the original user’s query to retrieve the initial relevant
documents; then we assign a weight to each relevant document which defines the importance of that document to the user’s
information need. Finally, we modify the weight of each query term based on their occurrence in these weighted documents. Our
method can be categorized as one of the local feedback query modification methods. Local feedback query modification methods
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 3
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
use the context in the documents that are retrieved for a given query in the first phase to reformulate the query. In Sections 3.1
and 3.2 we introduce the calculation of weights for each term in the re-weighting schema.
3.1. Local feedback using document similarity
In the local feedback methods, the main source of extracting information regarding the user’s information need is the initial
retrieved documents from the first phase. For instance, Bendersky and Croft (2008)andLiu, Natarajan and Chen (2011) use the
initial retrieved documents to detect more effective words in order to add them to the user’s original query. In our approach,
we use these documents to weigh the user’s original query terms because we believe this will achieve a more effective way
of using the original query terms. Assuming the top retrieved documents to be relevant is a common assumption in pseudo-
relevance feedback methods. Although this assumption carries the danger of query drift (Manning, Raghavan, & Schütze, 2008),
it is reported that using the top retrieved documents, in a controlled way, improves retrieval effectiveness significantly (Lavrenko
& Croft, 2001; Lee, Croft, & Allan, 2008; Robertson & Jones, 1976).
We can represent the original query by the vector Q as Q={q1,q2,...}where qidenotes the ith query term. We also define
the final weight of each query term qias follows:
Wqi=
N
j=1
Wdj
qi(1)
In Eq. (1),Wqidenotes the final weight of qi,Nis the number of selected top documents from the initial retrieved documents, and
Wdj
qidenotes the weight contributed by document dj,jth retrieved document, because of the query term qi.
Our hypothesis is that Wdj
qiis valuable if djis truly relevant to the user’s information need; that is, although our retrieval
engine retrieved djas a relevant document, there is a chance that djmight not be as relevant as it should. To address the issue
of the importance of the documents in the retrieved document set, we can define Wdj
qitheadjustedweightofthetermqiin the
document dj, as below:
Wdj
qi=wdj
qi×vdj(2)
In Eq. (2),wdj
qidenotes the weight of the term qiin the document djand vdjdenotes the relevance of djto the user’s information
need.
The standard TF-IDF weighting model could be used for calculating the base weight of the query term qiin the document dj
(wdj
qiin Eq. (2)), as below:
wdj
qi=Fdj
qi×IDFqi(3)
In Eq. (3),Fdj
qidenotes the frequency of qiin dj,andIDFqidenotes the inverse document frequency of qiin the whole collection
set. To calculate the importance of the document to the original query or vdj,inEq. (2), we assume that the initial retrieved
documents are a good prediction for the user’s information need. Thus we measure the relevance of each document to the user’s
information need by evaluating the distance of each document to the other documents in the retrieved set. The size of the
retrieved set is defined experimentally. We use the following equation to compute the similarity of each top document to the
other documents in the set:
vdj=N
k=1,k= jSim
dk,
dj
N1(4)
In Eq. (4),
dkand
djare the related Euclidean vectors of kth and jth documents in the retrieved set, Sim is cosine function for
evaluating the similarity between
dkand
dj,andNis the number of selected top documents from the initial retrieved documents.
Next, we can obtain Eq. (5) using, (2)–(4).
Wqi=
N
j=1
Fdj
qi×IDFqi×vdj(5)
Finally, we use the log normalization to smooth the calculated values:
Wqi=log 1+
N
j=1
Fdj
qi ×IDFqi×vdj(6)
The constant one is added in Eq. (6) to avoid having zero in the logarithm. Eq. (6) can be used to re-weigh query terms. Wqiin
this equation is proportional to the frequency of query terms in the top documents, and to the IDF of query terms in the whole
collection. Moreover, it is sensitive to the documents which have the highest similarity to the other top documents—through vdj.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
4P. Karisani et al. /Information Processing and Management 000 (2015) 1–12
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
3.2. Query based selection
Eq. (4) assumes that the optimal point for the weight of documents is the point which has the minimum distance from all
the top documents. Thus the closer a document is to the center of the cluster, the higher is the weight of the document. This is a
simplifying universal assumption which is made regardless of the actual behavior of the queries. For example, vague queries carry
more than one context in their results. Therefore, their result set can contain several clusters of documents. A vague query such as
“Frank Sinatra in L.A.” may retrieve documents about his personal life in L.A., his concerts in L.A., or even his song about L.A. This
is the drawback of the above assumption that could potentially cause promotion of documents with more general vocabulary in
our method.
To tackle this issue, we assign a higher weight to the documents which have a higher similarity to the user’s original query.
Therefore, documents in different clusters can get a high weight only if their topic is close to the topic of the original query. Based
on this intuition, Eq. (4) can be modified as below:
vdj=K×N
k=1,k= jSim
dk,
dj
N1+(1K)×Sim
dj,
QL
(7)
In Eq. (7),
Qdenotes the Euclidean vector of the original query, and variables Kand Lare constant and should be tuned through
experiments. Now we can replace Eq. (4) with Eq. (7) in Eq. (6) as below:
Wqi=log
1+IDFqi×
N
j=1
Fdj
qi×K×N
k=1,k= jSim
dk,
dj
N1+(1K)×Sim
dj,
QL
(8)
In Eq. (8), the relation between qiand djis calculated twice:
1. When we multiply the term Fdj
qi×IDFqiby vdj.
2. When we use the term Sim(
dj,
Q).
To reduce the effect of this relation, first, we define Qias follows:
Qi=Q{qi}(9)
So if we omit qifrom Qwe will have Qi. Then, in Eq. (8), we replace Qwith Qiin order to reduce the number of times which
this relation is used. Thus we have:
Wqi=log
1+IDFqi×
N
j=1
Fdj
qi×K×N
k=1,k= jSim
dk,
dj
N1+(1K)×Sim
dj,
QiL
(10)
Finally, in order to have a fixed range of weighting values between 0 and 1, we normalize the final weights of the terms in
each query using Wmax.W
max is the maximum weight of the terms in that query. In practice, we used the normalized values.
Eq. (10) is a simple formula; besides, this equation uses known definitions like TF-IDF and cosine similarity. However, what
makes this equation effective—as we will see in Section 4—is the arrangement of its components. First, the content of each top
document is emphasized through the multiplication of Fdj
qiand vdj. Thus the documents that cover more of query context will be
favored over those that only partially match the query context. Since the query re-weighting will use only the weight of the top
documents, out of context or noisy documents will have less chance of diluting weighting of the query terms. This factor becomes
even more important in real world situations where it can dampen the effect of spam documents. The second characteristic of
this model is dampening the effect of presence of a single query term in the retrieved documents. That is achieved through
the use of Qi; in fact, by measuring the similarity between
Qiand
dj, this equation ensures that the similarity between the
document and the query is not achieved through the presence of qiin dj. Otherwise, the documents which frequently use qiand
lack consistent use of other query terms can contribute to the weight of qimore than they should.
4. Results
We have evaluated our method in English and Persian language data sets. For English language, we have used the FIRE
(Majumder et al., 2010) corpus. The last version of this data set was published in 2011. For Persian language, we used versions
one and two of a standard data set named Hamshahri (AleAhmad, Amiri, Darrudi, Rahgozar, & Oroumchian, 2009).1,2Persian
language is an Indo-European language, and it is one of the dominant languages in the Middle East. This language primarily
is spoken in Iran, Tajikistan, and Afghanistan. In this section, first, we explain our experimental setup, and then we report the
results.
1http://ece.ut.ac.ir/dbrg/hamshahri/index.html.
2http://www.hamshahrionline.ir/,http://en.wikipedia.org/wiki/Hamshahri.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 5
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Tabl e 1
Attributes of the data sets.
Attribute FIRE Hamshahri 1 Hamshahri 2
Collection size 0.99 GB 599 MB 1.43 GB
Encoding ASCII UTF-8 UTF-8
No. of documents 379,820 166,774 318,517
No. of unique terms 525,263 493,537 680,653
Average length of documents 290 terms 238 terms 283 terms
Average length of queries 3.4 3.1 3.5
No. of queries 50 100 50
Fig. 1. Distribution of documents in 9 major categories of Hamshahri collections.
4.1. Experimental setup
The aim of the Forum for Information Retrieval Evaluation (FIRE)3is to create an evaluation framework like TREC, CLEF, and
NTCIR. We used the last edition of their corpus and its queries (queries 126–175) which were published in 2011. In Persian
Language, we used two versions of the Hamshahri standard data set to evaluate our method. Hamshahri 1 (AleAhmad et al.,
2009) which contains the news articles of Hamshahri newspaper2from year 1996 to 2003, and Hamshahri 2 which includes the
news articles of this newspaper from year 1996 to 2007. Table 1 summarizes some attributes of these collections. It can be seen
that the average length of the queries in all data sets are about 3 terms. Technically what makes long queries different from short
queries is that short queries may not contain sufficient context for disambiguation of query terms. Therefore the information
need of the user may not easily be understood.
Figs. 1 and 2show the categories of Hamshahri data sets, and the distribution of their documents and queries over these
categories, respectively. For detailed information about FIRE data set, reader is referred to Majumder et al. (2010).
We used Luc ene44.8.1 for indexing and retrieval. Porter stemmer is used for stemming both English documents and queries.
Due to the lack of a good stemmer in Persian language, we did not perform any stemming in the Persian data sets. For stop word
removal, we used the standard INQUERY (Allan et al., 2000) stop word list for FIRE data set, and a list of 774 Persian common
words5for Hamshahri data sets. For query term re-weighting, we used the default approach of Lucene (called boosting) (Apache
Software Foundation), which multiplies the final contribution of each query term to the score of a document by the weight which
is assigned to that query term. We also used R6tool for testing the significance of the difference between our method and others.
We chose a language modeling approach similar to Zhai and Lafferty (2001) with Jelinek–Mercer smoothing as our base
model for comparison purposes. In this model, the documents are ranked by their probability of generating the query. Currently,
this model is one of the best retrieval models. Improving the performance over this model is quite challenging. Jelinek–Mercer
smoothing is a variation of language modeling that improves the performance of language modeling for queries with infrequent
3http://www.isical.ac.in/fire/.
4http://lucene.apache.org/.
5http://ece.ut.ac.ir/dbrg/hamshahri/download.html.
6http://www.r-project.org/.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
6P. Karisani et al. /Information Processing and Management 000 (2015) 1–12
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Fig. 2. Distribution of queries over the categories in Hamshahri collections.
terms. Those are the terms that may not appear enough number of times in the training sample set that is used for estimating
the initial probabilities in the model. This method uses a linear interpolation technique to smooth the maximum likelihood
document models using a coefficient λas follows:
Pti|Mj=(1λ)fi,j
kfk,j
+λFi
kFk
(11)
In Eq. (11),Mjdenotes the language model of the document djin the collection, fi, j denotes the frequency of the term tiin dj,
and Fidenotes the frequency of tiin the whole collection.
In order to compare our method with another re-weighting model, we have implemented the Weighted Information Gain
(WIG) method described in Zhou and Croft (2007) to re-weigh query terms. For a given query term, WIG measures the change
in information about the quality of retrieval from a state that only an average document is retrieved to a state that the actual
results are retrieved. Zhou and Croft (2007) hypothesize that WIG is positively correlated with retrieval effectiveness, because
high quality retrieval should be more effective than returning an average document. Therefore, we expect the WIG method to
assign a higher weight to the more important query terms. Bendersky and Croft (2008) have reported their experiments for
discovering key concepts in verbose queries using WIG along with other common measures (like TF and IDF). Their experiments
show that WIG is one of the most effective methods for concept re-weighting. We used normalized WIG in our experiments
which is defined as below:
wig(qi)=
1
NdTN(qi)log p(qi|d)log p(qi|C)
log p(qi|C)(12)
In Eq. (12),wig(qi) denotes the weight which is assigned to the query term qi,T
N(qi) denotes the top document set which is
retrieved in response to query term qi,Nis the number of selected top documents, p(qi|d) is maximum likelihood estimate which
is calculated using Eq. (11),andp(qi|C)iscalculatedasbelow:
p(qi|C)=Fi
jFj
(13)
In Eq. (13),Fiis the frequency of the term tiin the whole collection. We believe comparing our method with both a language
modeling and a query re-weighting method enables us to better understand the general performance of our method.7
For Hamshahri 1 collection, we divided the queries into two sets, the first 50 queries were used for learning and estimating
the parameters, and the second 50 queries were used for the evaluation. In FIRE and Hamshahri 2 data sets, However, we used
standard 10 fold cross validation for evaluation. Thus in each step we used 90% of the queries for the training procedure, and 10%
for the test procedure.
In the training procedure, we used MAP criterion to find the best parameter setting. For Jelinek–Mercer smoothing, the value
of λwas optimal at 0.2; we used this value for the retrieval process and WIG re-weighting approach. Moreover, we experimented
with different retrieved set sizes (N={10, 20, 30, 40, 50, 60, 70, 80, 90, 100}) in Eq. (12).InEq. (10), there are three parameters
7We also implemented Robertson’s probabilistic model with query term re-weighting. Due to lack of any significant improvements over the baseline, we did
not report the results here.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 7
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Tabl e 2
Evaluation results for DS weighting on FIRE data set. αand β
indicate statistically significant improvements over language
modeling and WIG weighting, respectively.
FIRE
Model MAP P@10 R-precision
Language modeling 0.2503 0.362 0.2898
WIG weighting 0.2476 0.356 0.2855
DS weighting 0.2684αβ 0.368 0.2996β
Fig. 3. Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting on FIRE data set.
Tabl e 3
Evaluation results for DS weighting in Hamshahri data sets. αand βindicate statistically significant
improvements over language modeling and WIG term re-weighting, respectively.
Hamshahri 1 Hamshahri 2
Model MAP P@10 R-precision MAP P@10 R-precision
Language modeling 0.3339 0.556 0.3676 0.3958 0.628 0.4231
WIG weighting 0.3387 0.562 0.3705 0.4053 0.636 0.427
DS weighting 0.3577αβ 0.588 0.3818α0.4293αβ 0.65 0.4519αβ
which must be estimated. N, K,andL. We have experimented with the following values and their combinations for the three
parameters: N: {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}, K: {0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, and L: {1, 2, 3, 4, 5}.
4.2. Experimental results in English language
Table 2 shows the performance of our approach in comparison with WIG term re-weighting approach and simple language
modeling on FIRE data set. All three methods use the same language modeling for the retrieval of documents in the first phase.
However, WIG and our method (DS8) use a set of top documents to re-weigh the query terms. Both our method and WIG use the
re-weighted query to retrieve the final result set.
The achieved results indicate that our method improves the retrieval performance, in terms of MAP, up to 7.23% over language
modeling, and up to 8.4% over WIG term re-weighting, which is significant using paired t-test at p<0.05. Fig. 3 plots the precision–
recall curves for the same three models in Table 2.
4.3. Experimental results in Persian language
Table 3 shows the performance of our approach (DS) in comparison with WIG term re-weighting approach and simple lan-
guage modeling on Hamshahri data sets. We can observe that query term re-weighting using our approach improves retrieval
8Document Similarity.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
8P. Karisani et al. /Information Processing and Management 000 (2015) 1–12
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Fig. 4. Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting in Hamshahri 1.
Fig. 5. Retrieval performance for language modeling, WIG term re-weighting and DS term re-weighting in Hamshahri 2.
performance, in terms of MAP, up to 7.12% over language modeling, and up to 5.6% over WIG term re-weighting on Hamshahri 1
data set. Furthermore, improvements are higher in Hamshahri 2 data set; up to 8.45% over language modeling, and up to 5.92%
over WIG term re-weighting.
Figs. 4 and 5present the precision–recall curves for language modeling, WIG, and DS term re-weighting approaches on
Hamshahri 1 and Hamshahri 2, respectively.
Table 4 provides a query-by-query comparison of precision results for DS term re-weighting, language modeling and WIG
term re-weighting on Hamshahri 1 test collection. The queries are sorted by their improvement over language modeling from
high to low. We observe that our method improves the performance of 66% of the queries over language modeling. Moreover,
we can see that the improved queries are ranged from the queries with low performance (like query numbers 3 and 50) to the
queries with high performance (like query numbers 7 and 15). We have categorized the queries into two sets: specific or broad.
Although some of the broad queries also have improved but most of the improvements come from specific type queries. This
phenomenon could be explained by the nature of these broad type queries and the fact that these queries are short and lack
discriminative keywords.
Table 5 shows a number of queries from Hamshahri 1 dataset and their relative results. The weight of each term is shown in
brackets. Columns 3 and 4 show the performance of each query in language modeling and DS term re-weighting. Note that the
equivalence of some English words in Persian language (like “copyright” or “rationing”) have two parts; their weights are listed,
respectively. Besides, word “Yugoslavia” in query number 6, has zero weight. This word has two spelling in Persian language,
”and“ ”, so there is a spelling mismatch between the form which is used in the query 6 and what is in Hamshahri
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 9
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Tabl e 4
DS weighting query improvements in comparison to language modeling and WIG weighting in Hamshahri
1. The rows are sorted by improvements over language modeling (over LM %).
Query no. Length Category LM MAP WIG MAP DS MAP Over LM % Over WIG %
3 4 Specific 0.0673 0.084 0.1852 175.18 120.47
50 3 Broad 0.0793 0.1132 0.1612 103.27 42.40
28 4 Broad 0.1221 0.1515 0.2077 70.10 37.09
7 3 Specific 0.423 0.4507 0.659 55.79 46.21
46 3 Specific 0.1642 0.1657 0.2313 40.86 39.58
43 3 Specific 0.15 0.1717 0.2034 35.6 18.46
42 3 Specific 0.3926 0.3993 0.5226 33.11 30.87
41 5 Specific 0.1906 0.196 4 0.2535 33.00 29.07
29 3 Broad 0.4537 0.4683 0.5734 26.38 22.44
10 4 Specific 0.3735 0.3742 0.4673 25.11 24.87
30 3 Broad 0.1593 0.1375 0.1986 24.67 44.43
31 4 Specific 0.1761 0.1762 0.2145 21.80 21.73
15 4 Specific 0.4539 0.4535 0.5444 19.93 20.04
38 4 Specific 0.1583 0.1605 0.1761 11.24 9.71
48 3 Specific 0.1477 0.1476 0.162 9.68 9.75
9 4 Specific 0.2729 0.2711 0.2979 9.16 9.88
12 2 Broad 0.2475 0.2545 0.2635 6.46 3.53
14 3 Specific 0.2745 0.2723 0.2915 6.19 7.05
23 4 Specific 0.4684 0.4737 0.4914 4.91 3.73
2 3 Specific 0.6079 0.6213 0.6342 4.32 2.07
32 3 Specific 0.3352 0.3497 0.3479 3.78 –0.51
6 4 Specific 0.6103 0.6053 0.6314 3.45 4.31
25 4 Specific 0.1348 0.1462 0.1393 3.33 –4.71
49 4 Specific 0.0813 0.1128 0.0837 2.95 –25.79
45 4 Broad 0.1816 0.1963 0.1864 2.64 –5.04
26 3 Broad 0.3604 0.3646 0.3697 2.58 1.39
4 2 Broad 0.3853 0.3898 0.3918 1.68 0.51
27 2 Broad 0.6171 0.6181 0.6226 0.89 0.72
8 2 Specific 0.5259 0.5239 0.5292 0.62 1.01
16 3 Broad 0.8089 0.8123 0.8114 0.30 –0.11
19 4 Specific 0.2895 0.2902 0.29 0.17 0.06
5 4 Specific 0.958 0.958 0.9596 0.16 0.16
36 2 Broad 0.163 0.1601 0.1632 0.12 1.93
44 2 Broad 0.5607 0.5609 0.5607 0 0.03
39 3 Broad 0.9101 0.9105 0.9098 0.03 0.07
35 5 Specific 0.1132 0.1144 0.1126 0.53 1.57
1 3 Specific 0.1417 0.142 0.1406 0.77 0.98
13 2 Broad 0.4916 0.4893 0.4873 0.87 0.40
17 2 Broad 0.4458 0.4546 0.4417 0.91 2.83
47 3 Specific 0.562 0.561 0.5554 1.17 0.99
24 4 Specific 0.1417 0.1422 0.14 1.19 1.54
20 4 Specific 0.5005 0.5009 0.4895 2.19 2.27
34 3 Specific 0.4904 0.4872 0.4776 2.61 1.97
18 3 Broad 0.1464 0.1454 0.1411 3.62 2.95
37 4 Specific 0.3228 0.3189 0.306 5.20 4.04
33 6 Specific 0.1932 0.2037 0.1804 6.62 11.4 3
40 3 Broad 0.1551 0.1478 0.1383 10 .83 6.42
11 3 Specific 0.4322 0.4301 0.3838 11 .19 10.76
22 3 Specific 0.1792 0.1804 0.1529 14 .67 15.24
21 3 Specific 0.0757 0.0734 0.0 05 93.39 93.18
1 collection. Table 5 indicates that even in short queries it is possible to achieve improvement in the performance through
assigning higher weights to the more important query terms, and our method is partially successful in accomplishing this task.
However, there are some cases like query numbers 9 and 10, which do not contain a clear keyword in their terms; those are
queries which our method cannot improve or even may cause query drift.
4.4. Discussion
There are two main factors which play a central role in the performance of our approach:
1. The presence of keywords in the user’s original query; that is, there must be at least a term in the query which carry more
information in comparison with the other terms.
2. The number of relevant documents which are retrieved in response to this query, in the first cycle.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
10 P. Karisani et al. /Information Processing and Management 000 (2015) 1–12
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Tabl e 5
Sample query term weights assigned by DS weighting method in Hamshari 1.
No. Query LM MAP DS MAP
1 0.6079 0.6342
(Heart[0.87] Disease[0.83] and Smoking[1])
2 0.423 0.659
(Commemorations[0.31] of Sadi[1] Shirazi[0.26])
3 0.3735 0.4673
(Benefits[0.01] of Copyright[1, 0.98] Laws[0.39])
4 0.0673 0.1852
(Gas[1] Rationing[0.75, 0.42] in Iran[0.62])
5 0.4539 0.5444
(Remembrance[0.46] of Dr[0.77] Ali[0.54] Shariati[1])
60.12210.2077
(NATO[1] vs. Yugoslavia[0] War[0.49] in 1998[0.05])
7 0.4537 0.5734
(Global[0.21] Drought[1] Crisis[0.73])
8 0.1593 0.1986
(Iranian[0.86] Traditional[0.80] Celebrations[1])
9 0.1551 0.1383
(weave[0.88] rug[0.48, 1])
10 0.0757 0.005
(Television[0.19] and Mental[1] Health[0.96])
Fig. 6. Retrieval performance of DS weighting for different number of selected top documents in Hamshahri 1.
In order to measure the robustness of our method, we have experimented with the number of documents retrieved in the
first phase. It is expected that the noise (number of non-relevant documents) to increase by increasing the number of documents
used from the first phase. This noise could cause major problems for re-weighting by diluting the frequencies of important terms.
In our experiment, we fixed the parameters Land Kat their optimal values, and evaluated MAP criterion for different values of
N, which is the number of selected top documents for the re-weighting process. Fig. 6 shows the result of this experiment in
Hamshahri 1. We observed that even if we increase the number of the selected documents up to 300, our retrieval performance is
better than LM weighting. This experiment shows that our term re-weighting approach is stable against non-relevant documents
which may enter into the top retrieved documents.
The presence of informative keywords is another important factor influencing the performance of the system. For instance,
query 7, which is “Commemorations of Sadi Shirazi”, has a precision of 0.93 at document cut off of 15 (P@15). The term “Sadi”
(the Iranian poet) conveys more information than the terms “Commemorations” and ”Shirazi” (a reference to a city in Iran). As
aresult,Table 4 shows an improved MAP of up to 55.79%. On the other hand, query 21, which is “Television and Mental Health”
has the precision of 0 at the same document cut off. Because there is no clear informative keyword in this query to show the
intention of the user, we can see that it has affected the MAP value of this query dramatically.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
P. Karisani et al. /Information Processing and Management 000 (2015) 1–12 11
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Tabl e 6
The optimal parameters of DS weight-
ing in the data sets.
Parameters
Data set NK L
FIRE 20 0.9 4
Hamshahri 1 70 0.9 3
Hamshahri 2 20 0.9 4.4
Tabl e 7
Evaluation results for DS weighting using query descriptions. αand βindicate
statistically significant improvementsover language modeling and WIG weighting,
respectively.
Data Set Model MAP P@10 R-precision
FIRE Language modeling 0.3104 0.466 0.333
WIG weighting 0.317 0.452 0.3484
DS weighting 0.3639αβ 0.504αβ 0.3869αβ
Hamshahri 1 Language modeling 0.285 0.5 0.3310
WIG weighting 0.302 0.51 0.3452
DS weighting 0.3544αβ 0.58αβ 0.3847αβ
Hamshahri 2 Language modeling 0.2846 0.538 0.3226
WIG weighting 0.3011 0.554 0.334
DS weighting 0.365αβ 0.592αβ 0.3954αβ
Table 6 shows the optimal values of the three parameters N, K,andLin the data sets. The values in the FIRE and Hamshahri 2
data sets are the average values of the corresponding parameters in each fold of the cross validation process. The parameters in
the folds were mostly similar. Thus in order to avoid reporting repeated values, Table 6 only shows the averages. We can observe
that there is no a fixed point for parameter N(the number of top documents.) It varies from a data set to another. On the other
hand, the optimal value of parameter K(the coefficient similarity of a document to other top documents) tends to favor the
documents which have a higher similarity to the other top documents than those which are more similar to the query. Since
the higher the value of parameter K, the more influential will be the value of Eq. (4) in the final weightings. We predict this
behavior may change in the web environment. In the real world situations, due to the presence of spam documents in the top
list, overweighting top documents may cause query drift.
Regarding the execution time of our method, since we run the query for two times against the data set (once for retrieving the
top documents, and once for final results,) our method is slower than the baseline (language modeling.) However, considering
that we re-formulate the query through re-weighting the original terms, our method is faster than the expansion methods which
add new terms to the query. Because adding new terms usually causes reduction in the retrieval speed.
We also did another experiment in order to measure the effectiveness of our method for longer queries. In the data sets, we
used the description of the queries instead of their titles to measure the performance. The average length of query descriptions
in the FIRE, Hamshshari 1, and Hamshahri 2 data sets are 7.76, 6.67, and 6.46 terms, respectively. Table 7 reports the result of this
experiment. The results indicate that, on average, the performance of the long queries is lower than their shorter equivalences in
the Hamshahri 1 and Hamshahri 2 data sets. This phenomenon is due to the presence of the terms that are not directly related
to the users’ information need. Our method improves the performance up to 24.35% and 28.25% on MAP criterion over language
modeling in these data sets. On the other hand, the results of the FIRE data set show that the performance of the long queries is
higher than the shorter ones. Although these results signify that the terms which are used in the query descriptions are accurate,
our method still manages to improve the performance up to 17.24% on MAP criterion over language modeling. That is because
our method was able to correctly detect the more informative keywords from among all the keywords in the queries.
5. Conclusions and future work
In this paper, we proposed a straightforward approach to query term re-weighting. Our approach uses the initial query to first
retrieve a set of documents; then it weights each document based on its closeness to the user’s information need. These weights
are used in the recalculation of query term weights. Our approach improves the retrieval performance, in terms of MAP criterion,
by 7% over language modeling approach in three data sets. It also outperforms other query term re-weighting approaches such as
WIG term weighting model. We believe more sophisticated weighting methods can help to achieve even further improvements.
Therefore, in the future we try to look into various probabilistic frameworks to achieve better results.
References
AleAhmad, Abolfazl, Amiri, Hadi, Darrudi, Ehsan, Rahgozar, Masoud, & Oroumchian, Farhad (2009). Hamshahri: A standard Persian text collection. Knowledge-
Based Systems, 22(5), 382–387.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
12 P. Karisani et al. /Information Processing and Management 000 (2015) 1–12
ARTICLE IN PRESS
JID: IPM [m3Gsc;November 9, 2015;14:6]
Aleahmad, Abolfazl, Hakimian, Parsia, Mahdikhani, Farzad, & Oroumchian, Farhad (2007). N-gram and local context analysis for Persian text retrieval. In Proceed-
ings of the 9th international symposium on signal processing and its applications, ISSPA 2007. IEEE.
Allan, James, Connell, MargaretE, Croft, WBruce, Feng, Fang-Fang, Fisher, David, & Li, Xiaoyan (20 00). Inquery and trec-9.DTICDocument.
Apache Software Foundation. Tf-IDF similarity (lucene 4.8.1api). Av ailab le from : http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/similarities/
TFIDFSimilarity.html.Accessed10.08.15.
Bendersky, Michael, & Croft, WBruce (2008). Discovering key concepts in verbose queries. In Proceedings of the 31st annual international ACM SIGIR conference on
research and development in information retrieval.ACM.
Craswell, Nick, Robertson, Stephen, Zaragoza, Hugo, & Taylor, Michael (2005). Relevance weighting for query independent evidence. In Proceedings of the 28th
annual international ACM SIGIR conference on research and development in information retrieval.ACM.
Croft, W Bruce, Cronen-Townsend, Stephen, & Lavrenko, Victor (2001). Relevance feedback and personalization: A language modeling perspective. In Proceedings
of the DELOS workshop: Personalisation and recommender systems in digital Libraries.
Hakimian, Parsia, & Taghiyareh, Fattaneh (20 08). Customizing local context analysis for farsi information retrieval by using a new concept weighting algorithm.
In Proceedings of the third international workshop on semantic media adaptation and personalization, 2008. SMAP’08.. IEEE.
Kang, In-Ho, & Kim, GilChang (2003). Query type classification for web document retrieval. In Proceedings of the 26th annual international ACM SIGIR conference
on Research and development in informaion retrieval.ACM.
Lavrenko, Victor, & Croft, WBruce (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and
development in information retrieval.ACM.
Lee, Kyung Soon, Croft, WBruce, & Allan, James (2008). A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of the 31st annual
international ACM SIGIR conference on Research and development in information retrieval.ACM.
Liu, Ziyang, Natarajan, Sivaramakrishnan, & Chen, Yi (2011). Query expansion based on clustered results. Proceedings of the VLDB Endowment, 4(6), 350–361.
Majumder, Prasenjit, Mitra, Mandar, Pal, Dipasree, Bandyopadhyay, Ayan, Maiti, Samaresh, Pal, Sukomal, Modak, Deboshree, & Sanyal, Sucharita (2010). The FIRE
2008 evaluation exercise. ACM Transactions on Asian Language Information Processing (TALIP), 9(3), 10.
Manning, Christopher D, Raghavan, Prabhakar, & Schütze, Hinrich (2008). Introduction to information retrieval: Vol. 1. Cambridge: Cambridge University Press.
Robertson, Stephen E, & Jones, KSparck (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.
Saboori, F., Bashiri, H., & Oroumchian, Farhad (2012). Assessment of query reweighing, by rocchio method in farsi information retrieval. International Journal of
Information Science and Management (IJISM), 6(1), 9–16.
Salton, Gerard, Wong, Anita, & Yang, Chung-Shu (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Sieg, Ahu, Mobasher, Bamshad, & Burke, Robin (2007). Web search personalization with ontological user profiles. In Proceedings of the sixteenth ACM conference
on information and knowledge management.ACM.
Xu, Jinxi, & Croft, WBruce (2000). Improving the effectiveness of information retrieval with local context analysis.ACM Transactions on Information Systems (TOIS),
18(1), 79–112.
Zaragoza, Hugo, Craswell, Nick,Taylor, MichaelJ., Saria, Suchi, & Robertson, StephenE.(20 04). Microsoft Cambridge at TREC 13: Web and hard tracks.InProceedings
of the text retrieval conference, TREC.
Zhai, Chengxiang, & Lafferty, John (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th
annual international ACM SIGIR conference on research and development in information retrieval.ACM.
Zhou, Yun, & Croft, WBruce (2007). Query performance prediction in web search environments. In Proceedings of the 30th annual international ACM SIGIR confer-
ence on research and development in information retrieval.ACM.
Please cite this article as: P. Karisani et al., A query term re-weighting approach using document similarity, Information Pro-
cessing and Management (2015), http://dx.doi.org/10.1016/j.ipm.2015.09.002
... Since the retrieval performance depends on the term matching score of query and document, the search engines need to focus on the central terms and reduce erroneous matching. For decades, several supervised and unsupervised query term weighting approaches (Paik and Oard 2014;Zheng and Callan 2015;Karisani, Rahgozar, and Oroumchian 2016) are proposed to address this problem, but these methods lag behind in capturing the intrinsic meaning of the query. Although several semantic neural retrieval models are proposed in recent times, these models need well-tuned search space for efficient performance (Hammache and Boughanem 2021), especially when the datasets are large, and in these cases, the traditional term-based IR models help. ...
... The retrieval effectiveness of the proposed BERT-based Attentive Sequential Dependency Model (B-ASD) for the description query topics is measured using recall and utility-based metrics (such as Normalized Discounted Cumulative Gain (NDCG) and Expected Reciprocal Rank (ERR) (Paik et al. 2021), which is shown in Table 1. This table also reflects the comparison of the proposed model with the basic sequential dependence model (Metzler and Croft 2005) and the two query term weighting methods, i.e., (i) term re-weighting with document similarity (DSW) (Karisani, Rahgozar, and Oroumchian 2016) and (ii) term re-weighting with distributed representations (deepTR) (Zheng and Callan 2015). ...
Article
The query-document term matching plays an important role in information retrieval. However, the retrieval performance degrades when the documents get matched with the extraneous terms of the query which frequently arises in verbose queries. To address this problem, we generate the dense vector of the entire query and individual query terms using the pre-trained BERT (Bidirectional Encoder Representations from Transformers) model and subsequently analyze their relation to focus on the central terms. We then propose a context-aware attentive extension of unsupervised Markov Random Field-based sequential term dependence model that explicitly pays more attention to those contextually central terms. The proposed model utilizes the strengths of the pre-trained large language model for estimating the attention weight of terms and rank the documents in a single pass without any supervision.
... The performed evaluations demonstrated the effectiveness with respect to the baseline. Also, the work presented by Karisani et al. (2016) has proposed a method to identify and re-weight informative query terms, by examining the similarity of top documents and weighting them based on their context. The analysis of results obtained indicate that the suggested method is capable of identifying the most important keywords even in short queries and it improves retrieval performance around of 7% of MAP over traditional query term re-weighting methods. ...
... They showed an increasing of performance around 13% of MAP, but they only tested on set of short keyword queries and sub-corpora. Second, Karisani et al., (2016) proposed a method, in a local analysis, for identifying and re-weighting informative query terms, that improves retrieval performance around of 7% of MAP over traditional query term re-weighting methods. This proposed work give a solution to represent the keywords as a semantic hierarchy of the query, thus the user's need. ...
Article
Full-text available
In this paper, the authors propose and readapt a new concept-based approach of query expansion in the context of Arabic information retrieval. The purpose is to represent the query by a set of weighted concepts in order to identify better the user's information need. Firstly, concepts are extracted from the initially retrieved documents by the Pseudo-Relevance Feedback method, and then they are integrated into a semantic weighted tree in order to detect more information contained in the related concepts connected by semantic relations to the primary concepts. The authors use the “Arabic WordNet” as a resource to extract, disambiguate concepts and build the semantic tree. Experimental results demonstrate that measure of MAP (Mean Average Precision) is about 10% of improvement using the open source Lucene as IR System on a collection formed from the Arabic BBC news.
... Once the expansion terms are selected, a 'term re-weighting' technique is generally used to assign a meaningful weight to each query term according to its importance [12,16,17]. Various works are based on Rocchio's [18] hypothesis that states that the original query terms should be weighted higher than the expansion query terms. ...
Article
Full-text available
Query reformulation is a well-known technique intended to improve the performance of Information Retrieval Systems. Among the several available techniques, Query Expansion (QE) reformulates the initial query by adding similar terms, drawn from several sources (corpus, knowledge resources), to the query terms in order to retrieve more relevant documents. Most QE methods are based on the relationships between the original query term and candidate terms (new terms) in order to select the most similar expansion terms. In this paper, we suggested a new hybrid query reformulation through QE and term re-weighting techniques. The suggested approach aimed to demonstrate the effectiveness of QE with a semantic selection of candidate terms according to the specificity of original query terms in the improvement of retrieval performance. To this end, we exploited both relationships defined by knowledge resources and the distributed semantics, recently revealed by neural network analysis. For term re-weighting, we proposed a new semantic method based on semantic similarity measure that assigns a weight to each term of the expanded query. The conducted experiments on OHSUMED and TREC 2014 CDS test collections, including long and short queries, yielded significant results that outperformed the baseline and state-of-the-art approaches.
... There are several methods for deciding each matrix element's value. One of the known schemes is TF-IDF [52]. The relevance of a word within a document is measured by TF, whereas the global significance of a term within a dataset is measured by DF [53]. ...
Article
Full-text available
We propose a novel text classification model, which aims to improve the performance of Arabic text classification using machine learning techniques. One of the effective solutions in Arabic text classification is to find the suitable feature selection method with an optimal number of features alongside the classifier. Although several text classification methods have been proposed for the Arabic language using different techniques, such as feature selection methods, an ensemble of classifiers, and discriminative features, choosing the optimal method becomes an NP-hard problem considering the huge search space. Therefore, we propose a method, called Optimal Configuration Determination for Arabic text Classification (OCATC), which utilized the Particle Swarm Optimization (PSO) algorithm to find the optimal solution (configuration) from this space. The proposed OCATC method extracts and converts the features from the textual documents into a numerical vector using the Term Frequency-Inverse Document Frequency (TF–IDF) approach. Finally, the PSO selects the best architecture from a set of classifiers to feature selection methods with an optimal number of features. Extensive experiments were carried out to evaluate the performance of the OCATC method using six datasets, including five publicly available datasets and our proposed dataset. The results obtained demonstrate the superiority of OCATC over individual classifiers and other state-of-the-art methods.
... A wellknown method for relevance feedback is Rocchio's (1971) which is based on the vector space model, and another primary study is that of Croft and Harper (1979) which is a probabilistic approach. Karisani et al. (2016) proposed a method to extract the most informative terms in a set of documents for PRF. A set of documents is retrieved using the user's initial query and then a weight is assigned to each document describing the document's closeness to the user's information need. ...
Article
Full-text available
We propose a method for automatic optimization of pseudo relevance feedback (PRF) in information retrieval. Based on the conjecture that the initial query’s contribution to the final query may not be necessary once a good model is built from pseudo relevant documents, we set out to optimize per query only the number of top-retrieved documents to be used for feedback. The optimization is based on several query performance predictors for the initial query, by building a linear regression model discovering the optimal machine learning pipeline via genetic programming. Even by using only 50–100 training queries, the method yields statistically-significant improvements in MAP of 18–35% over the initial query, 7–11% over the feedback model with the best fixed number of pseudo-relevant documents, and up to 10% (5.5% on median) over the standard method of optimizing both the balance coefficient and the number of feedback documents by grid-search in the training set. Compared to state-of-the-art PRF methods from the recent literature, our method outperforms by up to 21% with an average of 10%. Further analysis shows that we are still far from the method’s effectiveness ceiling (in contrast to the standard method), leaving amble room for further improvements.
... Pseudo-relevance feedback (PRF) is a method that used in a branch of automatic query modification technique. PRF assumes that the initial retrieved documents are relevant and then it uses these documents to find more relevant terms to the query or it just re-weighs the original query terms (Karisani et al., 2016). Word embeddings (WE) is a common name to a set of techniques to model languages and extract interested features. ...
Article
Full-text available
The main goal of information retrieval is getting the most relevant documents to a user’s query. So, a search engine must not only understand the meaning of each keyword in the query but also their relative senses in the context of the query. Discovering the query meaning is a comprehensive and evolutionary process; the precise meaning of the query is established as developing the association between concepts. The meaning determination process is modeled by a dynamic system operating in the semantic space of WordNet. To capture the meaning of a user query, the original query is reformulating into candidate queries by combining the concepts and their synonyms. A semantic score characterizing the overall meaning of such queries is calculated, the one with the highest score was used to perform the search. The results confirm that the proposed "Query Sense Discovery" approach provides a significant improvement in several performance measures.
... Azad and Deepak (2019) have provided a comprehensive survey of QE techniques, including various topics including weighting and ranking methodologies. Relevance feedback proposed in Karisani et al. (2016) assumes that the set of initial retrieved documents are relevant, then shows them to the user and, according to the user's choices, tries to identify more relevant terms and improve query by reweighting the user's original query. Modeling a text as a graph is another method used in IR and text classification. ...
Article
Full-text available
It has become evident that term weighting has a significant effect on relevant document retrieval for which various methods are proposed. However, the main question that arises is which weighting method is the best? In this paper, it is shown that proper aggregation of weights generated by carefully selected basic weighting methods improves retrieval of the relevant documents with respect to the user’s needs. Toward this aim, it is shown that even using simple central tendency measures such as average, median or mid-range over an appropriate subset of basic weighting methods provides term weight that not only outperforms using each basic weighting method but also results in more effective weights in comparison with recently proposed complicated weighting methods. Based on exploiting the proposed method on various datasets, we have studied the effects of normalization of the basic weights, normalization of the vector lengths, the use of different components in the term frequency factor, etc. Results reveal the criteria for selecting an appropriate subset of basic weighting methods that would be fed to the aggregator in order to achieve higher retrieval precision.
Article
The rapid growth of contents on the Web in different languages increases the demand of Cross-Lingual Information Retrieval (CLIR). The accuracy of result suffers due to many problems such as ambiguity and drift issue in query. Query Expansion (QE) offers reliable solution for obtaining suitable documents for user queries. In this paper, we proposed an architecture for Hindi–English CLIR system using QE for improving the relevancy of retrieved results. In this architecture, for the addition of term(s) at appropriate position(s), we proposed a location-based algorithm to resolve the drift query issue in QE. User queries in Hindi language have been translated into document language (i.e. English) and the accuracy of translation is improved using Back-Translation. Google search has been performed and the retrieved documents are ranked using Okapi BM25 to arrange the documents in the order of decreasing relevancy to select the most suitable terms for QE. We used term selection value (TSV) for QE and for retrieving the terms, we created three test collections namely the (i) description and narration of the Forum for Information Retrieval Evaluation (FIRE) dataset, (ii) Snippets of retrieved documents against each query and (iii) Nearest-Neighborhood (NN) words against each query word among the ranked documents. To evaluate the system, 50 queries of Hindi language are selected from the FIRE-2012 dataset. In this paper, we performed two experiments: (i) impact of the proposed location-based algorithm on the proposed architecture of CLIR; and (ii) analysis of QE using three datasets, i.e. FIRE, NN and Snippets. In the first case, result shows that the relevancy of Hindi–English CLIR is improved by performing QE using the location-based algorithm and a 12% of improvement is achieved as compared to the results of QE obtained without applying the location-based algorithm. In the second case, the location-based algorithm is applied on three datasets. The Mean Average Precision (MAP) values of retrieved documents after QE are 0.5379 (NN), 0.6018 (FIRE) and 0.6406 (Snippets) for the three test collections, whereas the MAP before QE is 0.37102. This clearly shows the significant improvement of retrieved results for all three test collections. Among the three test collections, QE has been found most effective along with Snippets as indicated by the results with the improvements of 6.48% and 19.12% over FIRE and NN test collections, respectively.
Article
In this article, a pseudo-relevance feedback (PRF)–based framework is presented for effective query expansion (QE). As candidate expansion terms, the proposed PRF framework considers the terms that are different morphological variants of the original query terms and are semantically close to them. This strategy of selecting expansion terms is expected to preserve the query intent after expansion. While judging the suitability of an expansion term with respect to a base query, two aspects of relation of the term with the query are considered. The first aspect probes to what extent the candidate term is semantically linked to the original query and the second one checks the extent to which the candidate term can supplement the base query terms. The semantic relationship between a query and expansion terms is modelled using bidirectional encoder representations from transformers (BERT). The degree of similarity is used to estimate the relative importance of the expansion terms with respect to the query. The quantified relative importance is used to assign weights of the expansion terms in the final query. Finally, the expansion terms are grouped into semantic clusters to strengthen the original query intent. A set of experiments was performed on three different Text REtrieval Conference (TREC) collections to experimentally validate the effectiveness of the proposed QE algorithm. The results show that the proposed QE approach yields competitive retrieval effectiveness over the existing state-of-the-art PRF methods in terms of the mean average precision (MAP) and precision P at position 10 (P@10).
Article
The importance of query performance prediction has been widely acknowledged in the literature, especially for query expansion, refinement, and interpolating different retrieval approaches. This paper proposes a novel semantics-based query performance prediction approach based on estimating semantic similarities between queries and documents. We introduce three post-retrieval predictors, namely (1) semantic distinction, (2) semantic query drift, and (3) semantic cohesion based on (1) the semantic similarity of a query to the top-ranked documents compared to the whole collection, (2) the estimation of non-query related aspects of the retrieved documents using semantic measures, and (3) the semantic cohesion of the retrieved documents. We assume that queries and documents are modeled as sets of entities from a knowledge graph, e.g., DBPedia concepts, instead of bags of words. With this assumption, semantic similarities between two texts are measured based on the relatedness between entities, which are learned from the contextual information represented in the knowledge graph. We empirically illustrate these predictors’ effectiveness, especially when term-based measures fail to quantify query performance prediction hypotheses correctly. We report our findings on the proposed predictors’ performance and their interpolation on three standard collections, namely ClueWeb09-B, ClueWeb12-B, and Robust04. We show that the proposed predictors are effective across different datasets in terms of Pearson and Kendall correlation coefficients between the predicted performance and the average precision measured by relevance judgments.
Article
Full-text available
This paper examines statistical techniques for exploiting relevance information to weight search terms. These techniques are presented as a natural extension of weighting methods using information about the distribution of index terms in documents in general. A series of relevance weighting functions is derived and is justified by theoretical considerations. In particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval. Different applications of relevance weighting are illustrated by experimental results for test collections.
Article
Full-text available
This year the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts participated in three of the tracks: the cross-language, question answering, and query tracks. We used approaches that were similar to those used in past years. Although UMass used a wide range of tools, from Unix shell scripts, to PC spreadsheets, three major tools and techniques were applied across almost all tracks: the Inquery search engine, query processing, and a query expansion technique known as LCA. All three tracks used Inquery as the search engine, sometimes for training, and always for generating the final ranked lists for the test. In the cross language track, we experimented some techniques for crossing the character encoding boundaries. Our efforts were moderately successful, but we do not believe that our approach worked well in comparison to other techniques. In the question answering track, we focused on bringing answer-containing documents to the top of the ranked list. This is an important sub-task for most methods of tackling Q&A, and we are pleased with our results. We are now looking at alternate ways of thinking about that task that leverage the differences between retrieval for Q&A and for IR. Finally, we continued to participate in the query track, providing large numbers of query variants, and running our system on the huge number of resulting queries. Our analysis showed how query expansion compensates for some of the problems that can occurs in query formulation.
Article
Full-text available
The Persian language is one of the dominant languages in the Middle East, so there are significant amount of Persian documents available on the Web. Due to the different nature of the Persian language compared to the other languages such as English, the design of information retrieval systems in Persian requires special considerations. However, there are relatively few studies on retrieval of Persian documents in the literature and one of the main reasons is the lack of a standard test collection. In this paper, we introduce a standard Persian text collection, named Hamshahri, which is built from a large number of newspaper articles according to TREC specifications. Furthermore, statistical information about documents, queries and their relevance judgments are presented in this paper. We believe that this collection is the largest Persian text collection, so far.
Conference Paper
Full-text available
Every user has a distinct background and a specific goal when searching for information on the Web. The goal of Web search personalization is to tailor search results to a particular user based on that user's interests and prefer- ences. Effective personalization of information access in- volves two important challenges: accurately identifying the user context and organizing the information in such a way that matches the particular context. We present an ap- proach to personalized search that involves building models of user context as ontological profiles by assigning implicitly derived interest scores to existing concepts in a domain on- tology. A spreading activation algorithm is used to maintain the interest scores based on the user's ongoing behavior. Our experiments show that re-ranking the search results based on the interest scores and the semantic evidence in an onto- logical user profile is effective in presenting the most relevant results to the user.
Conference Paper
Full-text available
A query independent feature, relating perhaps to document content, linkage or usage, can be transformed into a static, per-document relevance weight for use in ranking. The challenge is to find a good function to transform feature values into relevance scores. This paper presents FLOE, a simple density analysis method for modelling the shape of the transformation required, based on training data and without assuming independence between feature and baseline. For a new query independent feature, it addresses the questions: is it required for ranking, what sort of transformation is appropriate and, after adding it, how successful was the chosen transformation? Based on this we apply sigmoid transformations to PageRank, indegree, URL Length and ClickDistance, tested in combination with a BM25 baseline.
Article
A lot of digital Farsi content has been produced recently in middle-east. Local Context Analysis (LCA) is an automated query expansion method that adds concepts to the original query based on the initial retrieval using the original query. In our previous works we attempted to tune this method for Farsi language by manipulating three parameters which are number of concepts used for query expansion, number of initially retrieved documents for local feedback and number of passages for concept discovery and weighting. In this paper we seek to further customize this method for Farsi information retrieval. To compare our work to the previous attempts we have used Hamshahri collection and 60. We have experimented with different number of concepts and have also changed the concept weighting algorithm to improve retrieval performance.
Conference Paper
Current prediction techniques, which are generally designed for content-based queries and are typically evaluated on relatively homogenous test collections of small sizes, face serious challenges in web search environments where collections are significantly more heterogeneous and different types of retrieval tasks exist. In this paper, we present three techniques to address these challenges. We focus on performance prediction for two types of queries in web search environments: content-based and Named-Page finding. Our evaluation is mainly performed on the GOV2 collection. In addition to evaluating our models for the two types of queries separately, we consider a more challenging and realistic situation that the two types of queries are mixed together without prior information on query types. To assist prediction under the mixed-query situation, a novel query classifier is adopted. Results show that our prediction of web query performance is substantially more accurate than the current state-of-the-art prediction techniques. Consequently, our paper provides a practical approach to performance prediction in real-world web settings.
Conference Paper
We explore the relation between classical probabilistic models of information retrieval and the emerging language modeling approaches. It has long been recognized that the primary obstacle to effective performance of classical models is the need to estimate arelevance model: probabilities of words in the relevant class. We propose a novel technique for estimating these probabilities using the query alone. We demonstrate that our technique can produce highly accurate relevance models, addressing important notions of synonymy and polysemy. Our experiments show relevance models outperforming baseline language modeling systems on TREC retrieval and TDT tracking tasks. The main contribution of this work is an effective formal method for estimating a relevance model with no training data