ArticlePDF Available

A Distributed N-Gram Indexing System to Optimizing Persian Information Retrieval

Authors:

Abstract and Figures

As the amount of information and the number of queries has been increasing today, indexing is a good solution to fight with the inherent complexity of text retrieval and accelerating information retrieval in different languages. Also N-Gram Indexing is a solution of the issues such as stemming, misspellings, multilingual and partial matching and has the advantages of language independent and error endurance. Persian is a name of a language which is common in the Middle East. It is spoken in some countries like Iran, Afghanistan and Tajikistan. Therefore, Persian is the language of many documents is published on the net. But, not more researches have been done about the Persian documents retrieval. In this paper, we present a method for Persian documents retrieving using N-gram indexing and distribution technique. The proposed index is a method of more effective answering queries that increases the quality of information retrieval substantially and we gain more optimizing retrieval in Persian documents. But the speed of N-gram indexing is low; to solve this problem we design a distributed N-gram indexing mechanism for large systems of Persian language. Compare with the other methods in this field, we improve the quality of retrieved documents and also the speed of information retrieval.
Content may be subject to copyright.
AbstractAs the amount of information and the number of
queries has been increasing today, indexing is a good solution to
fight with the inherent complexity of text retrieval and
accelerating information retrieval in different languages. Also
N-Gram Indexing is a solution of the issues such as stemming,
misspellings, multilingual and partial matching and has the
advantages of language independent and error endurance.
Persian is a name of a language which is common in the Middle
East. It is spoken in some countries like Iran, Afghanistan and
Tajikistan. Therefore, Persian is the language of many
documents is published on the net. But, not more researches
have been done about the Persian documents retrieval. In this
paper, we present a method for Persian documents retrieving
using N-gram indexing and distribution technique. The
proposed index is a method of more effective answering queries
that increases the quality of information retrieval substantially
and we gain more optimizing retrieval in Persian documents.
But the speed of N-gram indexing is low; to solve this problem
we design a distributed N-gram indexing mechanism for large
systems of Persian language. Compare with the other methods
in this field, we improve the quality of retrieved documents and
also the speed of information retrieval.
Index TermsInformation retrieval, indexing, n-gram,
distributed, Persian.
I. INTRODUCTION
The process of information retrieval (IR) has become
especially important as a result of growing increase in the
amount of information stored in the available source [1]. IR
aims to help users to find the desired information among the
large amount of unstructured ones. There are different
models of IR such as Boolean retrieval, linguistic and
knowledge-based approaches, inferential networks and
statistical model, which includes vector space model (VSM),
probabilistic model, latent semantic indexing (LSI) and
document clustering [2]. Controlling and locating the best
and most relevant documents related to specific information
needs are the biggest challenge of the present time [1]. The
reasons of using IR systems can be summarized into three
cases[3]: (1) faster processing of large volume documents, (2)
comparing between documents, (3) having an organized
retrieval.
Indexing is one of the best acceptable methods of
retrieving information from the contents of documents and is
known as the most efficient solution to the rapid and careful
Manuscript received September 5, 2012; revised November 18, 2012.
This work was founded by Computer Research Center of Islamic Science
(CRCIS).
The authors are with the School of Computer Engineering, Iran
University of Science and Technology, Tehran, Iran (e-mail:
mddanesh@comp.iust.ac.ir, b_minaei@iust.ac.ir, kashefi@{iust.ac.ir,
ieee.org}).
retrieval of information [4], [5]. In this paper, considering the
growing trend of Persian documents in web pages, we
propose a distributed N-gram based indexing to improve the
IR quality. The proposed distributed architecture is to bring
scalability for multitude and ever growing amount of
information, and to cope with the large amount of retrieval
request.
The amount of information in Persian language on the
internet has increased in different forms. As the style of
writing in Persian language is not firmly defined, there are
too many web pages in Persian with completely different
writing styles for same words and phrases [6]. The most
common writing challenges in Persian language that affects
the IR are as follows: [7]-[9]
1) Writing in informal or colloquial. As an example words
such as and are used as their colloquial
forms as  and . Colloquial writing also
majorly affects the syntax of the sentences by ignoring
prepositions and substantial changes in ordering of
words.
2) Use of foreign words. In Persian web pages, especially in
scientific web pages, a lot of word from other languages
have being selected, which some of them are presented
in Persian language.
3) Complex Inflection. Persian language includes more
than 2800 declensional suffixes. Close fitting, non-close
fitting or fitting with pseudo-space of affixes with words
is another challenging issue in Persian. For the word
“vase” can officially wrong but commonly spell as
«
», «», and «» in Persian. In some cases
affixes and especially suffixes may change regarding
lemma. As an example the word “bird” spelled as 
but the plural form is , where the lemma have
deformed. This morphophonemic rules also affect the
affixes intra-combination. These rules need phonetic
attribute of words and so are hard to computationally
implement. For instance, single first person pronoun
adjective is inflects the word  as ,
 as ,  as , and combine with
plural suffix  as  but with  as .
4) Multiple types of writing for a word. In Persian some
words have different correct spelling. As an example,
words «» and «» are spelled correct, pronounce
the same, and equally mean “room”.
5) Homographs. The same word may have different
meanings. This is because of the lack of vowel in Persian
writing. Examples are «
» which means «man», «»
which means «died», but both of these words are written
as «».
6) Word spzacing. In Persian in addition to white space as
inter-words space, an intra-word space called
A Distributed N-Gram Indexing System to Optimizing
Persian Information Retrieval
Mohadese Danesh, Behrouz Minaei, and Omid Kashefi
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
214
DOI: 10.7763/IJCTE.2013.V5.681
pseudo-space delimits word’s parts. Using white space
or do not using any space instead of pseudo-space (Zero
Width Non-Joiner in Unicode character encoding), is a
great challenge in Persian writing. The lack of a set of
comprehensive writing rules, results in writing the same
words and phrases in multiple forms by native Persian
writers. For example «
» is a complete, correct and
meaningful sentence means “Cool the water” but
«» is a word and it means “water cooler”.
In this paper, we studied the behavior of lemmatized
N-grams in retrieving Persian information, as N-gram
indexing method reacts well against spelling errors and
changes and overlooks lingual morphemic change. We used
document weighting to index and determine the rank of found
documents. Since retrieval speed is very important in
indexing, we used distributed processing because of the large
amount of retrieved information and decrease in the speed of
N-gram indexing. Results show that using a combination of
unigrams and 3-grams improves the retrieval of Persian
document.
We used the Hamshahri corpus as our test bed that consist
of the news articles of the Hamshahri magazine from 1996 to
2007. This collection contains over 160,000 articles with
different topics. The sizes of articles arevary from short news
articles (less than 1KB) to very long articles (140KB) with a
mean of 1.8KB. Emphatic signs, punctuations, stop words,
separators, and numbers have all been omitted during the
preprocessing of the corpus. The preprocessing basically
consists of a process to optimize the list of terms that identify
the corpus, which is previous to the query process. This
optimization can be focused to reduce the number of terms
eliminating those with poor information. In the more basic
case of preprocessing, a list of stop words is used to reduce
the number of terms that identify the collection. The included
terms in the stop words list do not provide information about
documents and have high frequency in them such as
,,,,, [10]. Another step in
preprocessing phase is lemmatization. The term
lemmatization stands for the reduction of an inflected word
forms to its stem or root form. Lemmatization also reduces
the number of terms that identify the document collection.
This issue is very important because reduces the storage
space and computational time [4].
In order to avoid linear scanning of the text corpus for each
query, which leads to considerably low response time and
performance, we index the documents well in advance [11].
For doing this operation, the major steps required are as
follows:
1) Collect the documents to be indexed
2) Breaking documents into word-based N-grams derived
from documents
3) Produce a list of indexing terms
4) Calculate n-grams of each query and compared it with
each of n-grams of documents
5) Weighting and organizing documents based on the
similarity of the documents with the queries
Search engines return a fixed number of matches ranked by
their similarity with input query. Ranking is a process of
matching a query to the documents and allocating scores to
the documents according to their degree of similarity [12].
The search should also be cost effective and time effective. It
is very important that the result obtained by the search query
should be equivalent to what we are actually searching for
[13]. We obtained best results for Persian texts retrieval with
word-based N-gram model using a new weighting method.
Since the string of large information consist of N-grams of
documents and queries, gets entire of memory space and for
dealing efficiently with high traffic of user queries, we used
alphabetical distribution as a stress test to share multitude
information into multiple processors and respond to several
user queries in a little time.
The remainder of this paper is organized as follow. Section
2 described the most related works, Section 3 studies some
attributes of Persian language and look over its challenges in
IR. Section 4 is dedicated to the concept of N-grams. Section
5 discusses the IR in a distributed way. The proposed N-gram
based IR method is described in Section 6. Evaluation results
came into Section 7 and Section 8 concludes the paper.
II. RELATED WORKS
Indexing a text document refers to automatically extracting
words that are suitable for building an index for the document.
Mansour et al. [14] use a number of grammatical rules to
extract stem words that become candidate index words and
computes weights for these words relative to the container
document. The weight is based on how spread is the word in a
document and not only on its rate of occurrence. The
candidate index words are then sorted in descending order by
weight so that information retrievers can select the more
important index words.
Another work on indexing is [15] that proposed the new
indexing method uses character based N-grams and
word-based systems so as to overcome the high false drop to
the mismatches between queries and documents, new word
and phrase problems that exist in the character-based and
word-based systems. Another work available in [16], handle
the high dimensionality of text documents, by map each
document (instance) into R (the set of real numbers)
representing the trigram frequency statistics profiles for a
document. Li et al. [17] proposed a new keyword extraction
method based on tf/idf with multi-strategies. The approach
selected candidate keywords of unigrams, bigrams and
trigrams, and show that their proposed method can
significantly outperform the baseline method. Ethan et al. [18]
discussed the implementation techniques that allowed to use
N-gram based retrieval methods on a gigabyte corpus on
commodity personal computer hardware. They showed that
using appropriately tuned gamma compression, extensible
hash tables and significant amounts of pre-calculation on the
inverted index allows the indexing of a one gigabyte
multilingual corpus with 256 MB of memory.
Fuzzy logic [19], VSM [20] and language modeling
methods[21]were used in the studies done about the retrieval
of Persian documents. Persian test collection is introduced in
[7] where authors used a number of documents in law domain.
The size of this collection is small (only 15 judged queries)
and the relevant documents are discovered by just one IR
system using pooling method. A main drawback of this
collection is being domain specific, resulting in a very limited
application. Nayeri and Oroumchian [19] proposed the
design and testing of a fuzzy retrieval system for Persian
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
215
(FuFaIR) with support of fuzzy quantifiers in its query
language. Their comparison shows that performance of
FuFaIR is considerably better than VSM. Also experiments
in [21] suggest the usefulness of language modeling
techniques for Persian. Oroumchian et al. [20] used N-gram
method in search for Persian documents and they give better
results by applying LCA (Local Context Analyses) and
N-Grams. Considering trigrams and 4-grams extracted from
documents and not using stemming process they showed that
compared with VSM, 4-gram method increases retrieval
efficiency for searching.
Our work presented in this paper differ from other
experiments in four major ways, (1) we used a larger test
collection with more queries that needs more calculation to
speed up IR, (2) we have tested a novel methods that not
employed yet, (3) we utilize word-based N-gram method to
get meaningful phrases that are most similar to queries, and
(4) we took advantages of distribution to gets our indexing
method less than 1 milliseconds.
III. PERSIAN LANGUAGES
Persian language is one of the used languages in the
Middle East, so there are significant amount of Persian
documents available on the web[16, 20]. There are 32 letters
in Persian language. Words are written from right to left and
numbers from left to right, both horizontally. There are no
differences between the capital and small letters. Most letters
join the following letters. Depending on the position of each
letter in a word, it can come after joining or non-joining
letters and this has caused complexities in this language [16].
A. Spelling Normalization
Normalization includes cleaning the context up from
unnecessary marks such as !, .,? And etc so that the text
would be prepared for the next lemmatizing and stopping
processing. Stop words are ones having no role in the word`s
meaning; lemmatizing includes writing the root of a word and
deleting the prefixes and suffixes.
Entrance of Arabic letters in farsi such as Tanwin ( “) and
Hamze (), two writing forms of some letters such as () and
() and some words such as  and  for "tehran",
different spaces among a word, between words, or even no
space, and create new words in farsi that is n't in the
dictionary are some problems causing changes in the results
of internet searches and should be resolved in the
normalization.
In our experiments we detected such errors using a
normalization method to convert every occurrence of these to
a single form of them. N-Grams method was used to identify
such errors and turn each word into a fix and standard form;
using this method, we could resolve the problems of the
words' structure, too.
B. Persian Lemmatization
One of the main challenges in IR systems is variations in
word forms. In most cases, morphological variants of words
have similar semantic interpretations and can be considered
as equivalent for the purpose of IR applications [22]. Persian
has a complex morphology. Referring this challenge,
stemming process is used to reduce inflected (or sometimes
derived) words to their stem, base or root form. Most Persian
words (except some proper nouns and words borrowed from
other languages) are derived from a root. A root usually
consists of three letters. We can view a word as derived by
first applying a pattern to a root to generate a stem and then
attaching prefixes and suffixes to the stem to generate the
word [23]. Thus, the key terms of a query or document are
represented by stems rather than by the original words. For
this reason, we uses the lemmatizer proposed in [8], [22].
IV. N-GRAM INDEXING
The operations of text normalization and indexing are
essential components of any IR system. We may regard the
document as having been processed into a stream of indexing
terms [11]. In information retrieval, each needed by user
word is compared with the saved documents. When the saved
documents are little, text is serially searched and each one of
the words in the document is compared by the user query, but
if there are a lot of documents, text search is divided into two
sections. 1) Indexing, 2) Searching. In indexing, an index of a
list of searched word and whole evaluated document texts are
prepared, but, in searching, only the index, instead of whole
text, is searched [24].
Efficient algorithms and data structures may be used to
index a document collection, to enable rapid retrieval of
documents containing query words [11]. In N-Grams
indexing, the retrieval location is also used for texts including
dictation errors or changes and, successfully, does the
alphabetic string compare among the words. This kind of
indexing does not need any previous information about the
concepts or language of the aimed text. N-Grams include
N-Characters or word burst.
In N-gram approach, a string of N sequential characters are
extracted from a text without considering word length or
boundaries and Word based N-gram consisting of N
consecutive words [25]. N-grams are generalized words
consisting of N consecutive symbols. The N-gram model
evaluates the statistical characteristics of texts in a given
collection without considering the dictionary, the verbal and
syntactic features of a natural language [16].
The N-gram length (N) and the method of extracting
N-grams from documents vary from one author and
application to another [16]. N-gram indexing method as one
of the types of indexing which reacts well against errors and
spelling changes in words [26] and also by changing indexing
and query limits from words to concepts [12] can produce a
rational result for Persian texts. Therefore, we can summarize
advantages of N-gram representation as follows: (1) they are
very resistant countering the grammatical and print errors of
the document. (2) They do not need language preprocessing
such word stemming or stop word removal.
There are disadvantages, too. One is the usage of an index
through N-Grams method leading to the creation of incorrect
results created by the production of N-Grams, which may be
semantically meaningless. N-Grams indexing can be as
massive as it can not be controlled or managed. In such cases,
finding the user's query among the indexes is time consuming,
makes problems in the speed of retrieving, and, as the users'
requests increase, this time would become longer [27].
N-Grams indexed words include overlapped substrings of
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
216
conclusive words which cannot be correctly reflect the
conceptual information of the text and this problem happens
in the character based indexing. So we used the text based
indexing in our proposal [15]. A few efforts have been done
to decrease the size of N-gram index [28], [29].
Index compacting is one way to reduce the index size and
save more indexes in a fewer space. Total time cost of data
transfer from memory to CPU cache and the omission of
compactness to the non-compacted data would be less. Also,
some method use diffuse extraction [4], [30]. This method
needs previous knowledge about the N-Grams frequency and
N-Grams extraction with higher frequency. This structure
reduces the precision rate because N-Grams might be
extracted having no relation with the user request. [31] was
proposed which has the advantages of N-Gram indexing and
doesn’t have a long run time building and loading indexes.
The main factor of the increase of disk space and running
time is that various N-Grams are generated from each word.
In N-Grams etymology method, a single N-Grams is replaced
with each word and the other N-Grams derived from it.
Most information retrieval systems use the word-based
N-Grams method because of its advantages to the character-
based one .some of this method`s advantages are:
Number of the unique words is fewer than the
character-based N-Grams generated in the same text (n>3).
Etymological methods can be only used in the based on
word systems.
Character-based N-Grams cannot remove the stop words,
but it is possible by using the standard statistical techniques
such as central subtracts or TF-IPF method.
Consequently, we use the based on word indexing method
to use the advantages of indexing method [18].
V. DISTRIBUTED INFORMATION RETRIEVAL
Recent days have seen an explosive growth in the
availability of various kinds of data. Vast amounts of
unstructured documents are available in many fields, and
when we consider the web, documents can be retrieved from
all over the world. A large size of documents has to be shifted
to retrieve the information profitable to the user [24].
Ranking, on the other hand, is a process of matching an
informal query to the documents and allocating scores to
documents according to their degree of similarity to the query
[3]. To organize and save a massive amount of alphabetic and
information, improving the search quality and reducing the
time cost, and an effective search method are necessary [13].
In the distributed information retrieval system, a medium
is needed to send the user's needs and query to the processors
to process, collect, and organize responses. Arranging and
providing the best responses are necessary in the response
organization stage [32].
VI. DISTRIBUTED N-GRAM INDEXING
A. Text Preprocessing
Farsi documents, because of their complexity and problem
when being processed, need a preprocessing before indexing.
The existence of various written prescriptions, spaces
between or in the words, characters' different dictation,
morphological changes, and pos tagger are the Farsi language
problems which should be resolved in the preprocessing
stage. This was necessary due to the variations in the way text
can be represented in Persian. The preprocessing section
receives the document (s) and the possible query from the
user and performs the following tasks:
Convert text files to UTF-8 encoding.
Word stemming; Sometimes a user type a special keyword
to search the web which that word does not exist in the
document exactly but its meanings are. Suppose that a certain
noun is singular in some cases and plural in other cases. If the
indexing algorithm does not perform word stemming, then it
would treat each form of this noun as a different and
independent term, but we know this is incorrect. In
information retrieval, we can consider words of morphemic
differences to be equal because most of the times they have
similar concepts. Keywords of a query or document are
presented in the form of its root. Because Stemming
algorithms are developed to decrease a word in the form of its
root. The number of the separate words available in them
decreases and counting frequencies is done only on these
roots [33]. A little researches are performed about Persian
stemming [23], [34], [35]. Affixes are not independent letters
which sometimes are used in combination with the other
words, while making new words, new meanings to join them.
These letters are divided into three categories: (1) the letters
which are added to the beginning of the words that we call
them prefixes. (2) The letters which are located between two
words that we call them infixes and (3) the letters which are
added to the end of words that we call them suffixes or
rhymes. As the affixes and effectiveness on words and
meanings are important in Indo-European languages, it is true
in Persian language which is subset of Indo-European
languages. Each word contains a stem that contains the main
idea In Persian. This means that total number of words can be
decreased into a few words that considered as stem, and the
rest of the words are derivation this stems. Deriving a word of
the principal stem causes that the main concept of that word
takes a better shape, or expresses a syntax role in a sentence
[22]. With this rule we stemmed both noun and verb to its
root completely, Checking nouns and verbs against of their
special Affixes and extracting stem words from nouns and
verbs separately.
B. N-Gram Extraction
Recognizing the text structure`s limitation which are
sentence, clauses, and words is the first step of text
processing called tokenization .Tokenization's Algorithms
identify the dots and spaces between words for determining
the text limitations.
We use N-grams model and move word by word from the
beginning of the document toward the end of the sentence in
order to produce N-grams. If m was the number of studied
words, the length of the existing N-grams in the sentence will
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
217
Detect and remove some characters such as punctuation
marks, diacritics, non-letters and stop words (Stop words are
the words that don’t play any roles in determining an special
range of information [10] in texts and queries and simplify
the searchable text in order to decreases the volume of the
documents, changes document weight and increases speed of
the search. We make a list of Persian stop words.
be between 1 and m-1. Continuing the process for other
sentences, if an extracted N-gram was repeated, its frequency
has just increased; otherwise it will be added to dictionary
with frequency of 1.
C. Weight Assignment and Index Selection
In this step, we assign weights to the stemmed words
regarding their document. Several methods are used for
weighting terms, which the most widely used is TF-IDF [17].
The first factor is the frequency of occurrence of that word in
its container document and second factor is its frequency in
whole documents. This way of weighting is a statistical
criterion which is used for the evaluation of the importance of
a special word in a collection of documents. Since we follow
the importance of a word in the whole collection of
documents and the weighting method of TF-IDF is efficient
just for a special document, we changed TF-IDF as (1). In our
experiment, it seems the weight of an N-gram depends on
three factors, which represent the significance of an N-gram
in its container document and in the corpus.
(1)
where
ni,j: the number of occurrences of the considered term in
document dj
|D|: total number of documents in the corpus
|{dj : tj
dj}|:number of documents where the term ti
appears
First factor, ji
tf ,, and second factor,
i
ifd
, are the same as
TF-IDF, but third factor,
ji
B,
, is depend on inverse
aggregation of N-gram in the document. With respect to our
observation, the rate of aggregation 1-grams is twice 2-gram
and triple 3-gram in any document. So we develop a formula
for the spread factor such that it increases as the word spreads
over the documents, and decreases as the term concentrates in
a specific document.
jiijiji Bifdtfw,,
,
(2)
Based on this result, we consider probability of 1-grams as
α, 2-gram is (α/2) and 3-grams are (α/3). In this formula, Wi,j
is the weight of the term i in the document j and Bi,jis the
desired n-gram coefficient which is calculated as follows (Bi,j
can be each of α, β or γ values dependent of n or length of
term i based on number of words):
α(No. of 1Grams) + β(No. of 2Grams) + γ(
No. of
3Grams) = 1
α: the probability of 1-grams
β: the probability of 2-grams = 2α
γ:the probability of 3-grams = 3α
(3)
Since the total probability for each existence in math is
supposed 1, then the total probability of n-grams in the query
is let 1 too, whereas in this paper we study up to 3-grams, so
the total probability of n-grams with n=1, 2, 3 has been
calculated. The values of α, β and γ show the probability of
1-grams, 2-grams and 3-grams respectively in the query and
documents. These n-gram coefficients multiply in the
number of the same n-gram in the query. If the length of
query was equals to 3, then the coefficient
ji
B,
for unigram,
2-gram and 3-gram terms are considered to be 16.7, 33.3 and
50 percent respectively.
If the query length is 2, then γ=0 and if the length is 1 then
ji
B,=0. In order to calculate the weights of the queries with
over 3 words, we divide them into unigram to 3-gram groups.
Then we calculate ji
B, for each group using the formula
presented in equation (2) and add them together. For example,
for the term «  the obtained sub queries are
according to Table I.
For each of the lines in Table I we add the weights obtained
in each column and consider the maximum weight obtained
in each line and take it as the final weight of a special
document. If all the N-Grams obtained in one of the lines of
the table are similar to another line, we ignore it. Finally, we
choose the documents which consists of the most term
weights based on the user query and arrange the in a
descending way.
D. Distribution
Data mining often needs to spend many resources for
saving and needs a high calculation time. Distributed data
mining analyzes the data in the distributed field and
compromises a concentrated set of data with the distribution
analysis of it. Using the data mining as distributed, data set
can be processed as parallel and distributed.
Serial search of all data sets, according to the CPU and I/O
consumption time cost, is inefficient and, as data become
massive, the memory problem will appear. In order to give a
better response to the user's query, distribution and reduction
of search cost are necessary.
Distributed system increases the subscription, efficiency,
time cost, extensibility features, and the quality of responding
to the users. Through this, data mining with massive data
problem can be resolved by linking separated computers to
the network [36].
Therefore, data is partitioned and transmitted to the
computing nodes in advance; it is important to efficiently
partition and distributes the data to other nodes for parallel
computation. In this paper, we focus on the problem of
N-gram distribution for indexing. N-grams can be partitioned
on to multiple processors with many algorithms. We use
alphabetical partitioning of N-grams and transfer them across
the processors. Each processor provides capability to create
transparent remote processes. Number of processors is
depending on volume of partitions.
Alphabetical distribution method was used in our study.
This method`s dependence on language and accomplishment
simplicity are its features. Propositional distribution
performance is based on the number of processors and,
according to the number of processors, divides the
ijiji idftftfidf ,,
kj
k,
ji,
,n
n
ji
tf
|}:{| ||
log
jjj
idtd D
idf
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
218
alphabetical letters based on the letter order; if m, n are
considered respectively as the number of processors and the
alphabetical letters, a number of (n/m) is located in the
processor. As Farsi language has 32 letters, we used four
processors as local servers to process the N-Grams. Therefore,
based on the Fig. 1, 8 letters are located in each processor,
respectively, and any number of N-Grams, of which primary
letter is in the same 8 letters' limitation, would be located in
the considered processor.
In order to assess the impact of this distribution, one
master and four slaves were involved in this experiment.
Each slave was to copy 0.5 megabyte of data from the master.
We reached the time less than a few seconds after
calculating the time needed for searching queries and reached
the time less than a few seconds after saving the search results
in a buffer. Table III shows the time relating to performing 20
queries in different states. Fig. 2 show this procedure in
detail.
Since we extracted the available 1-Gram to 3-Grams and
because there are many produced N-Grams, we would have a
lack of memory problem if we want to carry out the
distribution based on the N-Grams length (n), because the
number of 2-Grams is extremely more than 1-Gram and the
number of 3-Grams is extremely more than the 2-Grams. If
we design the 1, 2, and 3-Grams to each processor
respectively, we would have the upload problem in the RAM
(especially for 2 and 3- Grams) and the search speed would
be highly reduced, too.
TABLE I: N-GRAMS EXTRACTED FROM SENTENCE .
3-gram coefficient (γ) 3-gram 2-gram coefficient (β) 2-gram 1-gram coefficient (α) 1-gram
0.75 
0.50 
0.25 
0.75  0.5  0.25 
0.75  0.50 
0.25 
0.75 
0.75  0.25 
0.75  0.25 
0.25 
0.25 
Fig. 1. Division of alphabetical distributed.
For this reason, the propositional distribution method
balances the available N-Grams in each processor and there
will not be any problem in uploading the information and
N-Grams in the memory. In addition, the user query time
would be highly reduced. If the user query time in each
processor (i) is and the organizing time,comparing the
Grams attained from each query, and document ranking are
, we will have a time equation of  =max+.
Comparing this method with the other distribution methods
and distribution disuse, we will realize better results.
Fig. 2. Steps of distribution indexing.
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
219
VII. EVALUATION
To evaluate our proposed method, we used a standard
collection for Persian texts called Hamshahri. Hamshahri
corpus is the largest collection for Persian texts which was
prepared and distributed by Tehran University. It has 50
queries relevant to news articles. To store of produced
n-grams, N-Grams are load in memory for one time, and for
each search, we don't need to load any more. In this section,
we turn to the comparison of our proposed method with Local
Context analysis model [20] and Term-based method based
on lnu.ltu weighting model [20] for retrieving Persian texts.
Among of many solutions of Persian retrieval were suggested,
these two models, have better results rather than others in
precision and recall values. There are so many weighting
models to examining Term-Based method with our proposed
method [5], [7], [19], [20]. The reason we used lnu.ltu
method for comparison is that Oroumchian compare the
weighting method of lnu.ltu with other weighting methods
and find it more efficient than other weighting methods and
obtained better results in retrieving Persian documents
compared with other methods [6], [7], [19], [21]. Term-based
method based on lnu.ltu weighting model, refers to studying
each of the words (unigram) available in documents and
queries using the weighting method of lnu.ltu for every word
in the documents and queries. Local Context Analysis is
based on concepts, which are defined as single nouns, two
adjacent nouns or three adjacent nouns instead of keywords
or terms. These concepts are calculated in the local passages
retrieved and then top m ranking concepts are chosen for
query expansion. Local Context Analysis only marginally
improved the results over the Lnu.ltu method. This could be
due to the fact that the Lnu.ltu weighting method is
performing very well on the Farsi language and it is difficult
to improve over this good result. So, we run more
experiments with different parameters to see whether we can
find better configuration for LCA or not. To do this, we used
the criteria of precision and recall whose formula is shown in
(4) and (5) Since there is just Hamshahri judgment of queries
for our evaluation, so we have one choice to match our results
by other works in this way. Because of the large number of
queries that was used to comparison of our results with
judgments, we used the benefits of mechanical evaluation.
For this reason, each line of judgment's file that include of
precision value of each query, compare with our result. We
show the results of our evaluation in diagrams in Fig. 3 and
Table II.
||
|)||(|
retrieved
relevantretrieved
precision
(4)
|| |)||(| relevant
relevantretrieved
recall
(5)
The retrieved refers to the documents retrieved by the
proposed system and the term relevant is the average of the
documents denoted as relevant for each query in the corpus.
To evaluate, we obtain the precision relevant to each query
for fixed Recall levels e.g. 11 points in increments of 0.1.
Then we obtain the precision values of different queries for
a fixed recall which are shown in Table II. We compared the
evaluation results in Table II with other methods. For
example, for query 1: «
», its precision
value based on the call 1 is 0.7. We obtained the precision of
other queries for the recall 0.1 and then calculated the mean
precision value for a fixed recall value. List of entire queries
are exist in Hamshahri corpus.
Fig. 3. Recall/precision of the proposed approach in IR
TABLE II: THE PRECISION\RECALL VALUES.
Recall
Precision
Proposed method Term Based Local Context
Analysis
00.890.760.84
0.10.870.730.78
0.20.830.680.76
0.30.810.660.74
0.40.800.650.74
0.50.790.640.72
0.60.770.630.71
0.70.770.620.71
0.80.750.600.70
0.9
0.74 0.590.70
1 0.74 0.580.70
As seen in Fig. 3 our presented method is superior in
performance than the other methods of retrieving Persian
documents. The label local concept analysis in Fig. 3 means
using a vector space model based on the weighting method of
local concept analysis which was explained in details by [20].
They have proved that N-gram with N=4 are viable
approaches that outperform their non-stemmed term based
counter parts. They have resulted that LCA system and the
vector space model with Lnu.ltu weighting scheme and
slope=0.25 and pivoted unique normalization have better
performance than the FuFaIR [19].
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
220
The results from distributed processing are shown in Table
III and we could increase the run speed of queries to a large
extent. We use of client-server method and web-Based to
implement our distribution system. In this table the total time
of each quarry processing shows in each row with its ID. For
our evaluation, we use 20 queries of Hamshahri that were
prepare with Tehran University and be almost 4-grams and a
few number of 3-grams. These N-Grams are processed
simultaneity and execute in a few milliseconds. 4 Computer
are used as client and 1 computer as a server. Computer
properties to run and evaluate are, RAM: 2.47 GB, CPU:
INTEL 2- CORE, 2.40 GHz with O.S: windows 7 ultimate,
32 bit. In our test, local servers get N-Grams from master
server, calculate N-Gram weights and send documents that
are consist of their N-Grams. Master server send alphabetical
N-Grams to local servers with a network way, receive
documents Id from them, compare weight of each document
and rank documents based on their weights.
TABLE III: TOTAL TIME OF QUARRY PROCESSING.
Query ID
With Distribution Without Distribution
10.122.923
20.0531.888
30.1933.681
40.163.064
50.112.179
60.0561.093
70.0671.584
80.0742.496
90.0361.763
100.041.49
110.0862.874
120.2294.351
130.0762.878
140.1792.103
150.0491.744
160.0371.582
170.0111.15
180.225.524
190.083.134
200.073.03
VIII. CONCLUSIONS AND FUTURE STUDIES
In this paper, we extracted one-grams to 3-grams available
in documents and queries using an N-gram indexing method
and we reached a high performance in extracting Persian
documents by presenting a formula for weighting these
N-grams and finding documents with the highest weight. We
also used distribution method to increase the performance
speed of information retrieval and made it possible to query
in less than a few milliseconds. In future we plan to extract
4-grams from documents and compare the results with the
method proposed in this paper using lnu.ltu weighting
method.
REFERENCES
[1] F. CAN, Turkish Information Retrieval: Past Changes Future, Berlin,
Allemagne: Springer, vol. 4243, 2006.
[2] N. Fuhr, Models in information retrieval, in Lectures on information
retrieval, 2001, pp. 21-50.
[3] A. Maristella, et al., Eds., ed: Springer-Verlag New York, Inc., 2001,
pp. 21-50.
[4] C. Manning, et al., Introduction to Information Retrieval, Cambridge
University Press, 2008.
[5] J. Mayfield and P. McNamee, Single n-gram stemming, Presented at
the Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Informaion Retrieval,
Toronto, Canada, 2003.
[6] A. AleAhmad, et al., Experiments with English-Persian text retrieval,
Presented at the Proceeding of the 2nd ACM Workshop on Improving
Non English Web Searching, Napa Valley, California, USA, 2008.
[7] M. Shamsfard, et al., STeP-1: A set of fundamental tools for persian
text processing, Presented at the Proceedings of the Seventh
Conference on International Language Resources and Evaluation
(LREC'10), Valletta, Malta, 2010.
[8] F. Oroumchian and F. Garamaleki, An evaluation of retrieval
performance using farsi text, Presented at the First Eurasia
Conference on Advances in Information and Communication
Technology, Tehran, Iran, 2002.
[9] O. Kashefi, et al., Towards Automatic Persian Spell Checking. Tehran:
SCICT, 2011.
[10] A. Zamanifar, et al., A new hybrid farsi text summarization technique
based on term co-occurrence and conceptual property of the text,
Presented at the Proceedings of the 2008 Ninth ACIS International
Conference on Software Engineering, Artificial Intelligence,
Networking, and Parallel/Distributed Computing, Thiland, 2008.
[11] I. A. El-Khair, Effects of stop words elimination for arabic
information retrieval: A comparative study, International Journal of
Computing and Information Sciences, vol. 4, pp. 119-133, 2006.
[12] S. Renals, et al., Indexing and retrieval of broadcast news, Speech
Commun., vol. 32, pp. 5-20, 2000.
[13] A. Moffat and J. Zobel, Self-indexing inverted files for fast text
retrieval, ACM Trans. Inf. Syst., vol. 14, pp. 349-379, 1996.
[14] P. A. Vijaya, et al., Leaders-Subleaders: An efficient hierarchical
clustering algorithm for large data sets, Pattern Recognition Letters,
vol. 25, pp. 505-513, 2004.
[15] N. Mansour, et al., An auto-indexing method for Arabic text, Inf.
Process. Manage., vol. 44, pp. 1538-1545, 2008.
[16] L. Du and Y. Sun, A new indexing method based on word proximity
for chinese text retrieval, Journal of Computer Science and
Technology, vol. 15, pp. 280-286, 2000.
[17] L. Khreisat, A machine learning approach for Arabic text
classification using N-gram frequency statistics, Journal of
Informetrics, vol. 3, pp. 72-77, 2009.
[18] J. Li, et al., Keyword extraction based on tf/idf for Chinese news
document, Wuhan University Journal of Natural Sciences, vol. 12, pp.
917-921, 2007.
[19] D. E. Miller, et al., Techniques for Gigabyte-Scale N-gram Based
Information Retrieval on Personal Computers, ed, 2007.
[20] A. Nayyeri and F. Oroumchian, FuFaIR: A fuzzy farsi information
retrieval system, Presented at the Proceedings of the 4th ACS/IEEE
International Conference on Computer Systems and Applications,
Dubai/Sharjah, UAE, 2006.
[21] F. Oroumchian, et al., N-gram and local context analysis for persian
text retrieval, Presented at the 9th International Symposium on Signal
Processing and Its Applications, 2007.
[22] K. Taghva, et al., Language model-based retieval for farsi documents,
Presented at the Proceedings of the International Conference on
Information Technology: Coding and Computing (ITCC'04), vol. 2,
2004.
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
221
[23] O. Kashefi, et al., Optimizing document similarity detection in
persian information retrieval, Journal of convergence Information
Technology (JCIT), vol. 5, pp. 101-106, 2010.
[24] M. I. Mobarakeh and B. M. Bidgoli, Verb detection in persian corpus,
Journal of Digital Content Technology and its Applications (JDCTA),
vol. 3, pp. 58-65, 2009.
[25] M. Basavaraju and D. R. Prabhakar, Clustered distributed index for
efficient text retrieval using threads, International Journal of Grid
Computing & Applications (IJGCA), vol. 1, 2010.
[26] W. Cavnar, Using an n-gram-based document representation with a
vector processing retrieval model, in TREC, 1994.
[27] A. AleAhmad, et al., Cross language experiments at persian@CLEF
2008, in Evaluating Systems for Multilingual and Multimodal
Information Access. vol. 5706.
[28] C. Peters, et al., Eds., ed: Springer Berlin/Heidelberg, 2009, pp.
105-112.
[29] J. P. Callan, et al., Searching distributed collections with inference
networks, Presented at the Proceedings of the 18th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval, Seattle, Washington, United States, 1995.
[30] F. Scholer, et al., Compression of inverted indexes for fast query
evaluation, Presented at the Proceedings of the 25th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval, Tampere, Finland, 2002.
[31] W. Hersh, Managing gigabytes\— Compressing and indexing
documents and images (second edition), Inf. Retr., vol. 4, pp. 79-80,
2001.
[32] E. Sutinen and J. Tarhio, Filtration with q-samples in approximate
string matching, Presented at the Proceedings of the 7th Annual
Symposium on Combinatorial Pattern Matching, 1996.
[33] P. McNamee and J. Mayfield, N-gram morphemes for retrieval.
Morpho challenge 2007, Presented at the Available online in the
Working Notes of the CLEF 2007 Workshop, Budapest, Hungary,
2007.
[34] D. Hiemstra, Distributed information retrieval using keyword
auctions, Ed. Enschede: Centre for Telematics and Information
Technology, University of Twente, 2008.
[35] D. Jimenez, et al., The influence of semantics in IR using LSI and
K-means clustering techniques, Presented at the Proceedings of the
1st International Symposium on Information and Communication
Technologies, Dublin, Ireland, 2003.
[36] A. Mokhtaripour and S. Jahanpour, Introduction to a new farsi
stemmer, Presented at the Proceedings of the 15th ACM International
Conference on Information and Knowledge Management, Arlington,
Virginia, USA, 2006.
[37] K. Taghva, et al., A stemming algorithm for the farsi language,
Presented at the Proceedings of the International Conference on
Information Technology: Coding and Computing (ITCC'05), vol. I
vol. 01, 2005.
[38] W.-C. Shih, et al., Performance-based data distribution for data
mining applications on grid computing environments, The Journal of
Supercomputing, vol. 52, pp. 171-198, 2010.
International
Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013
222
Mohadese Danesh received her M. Sc. in Computer
Software Engineering from Iran University of
Science and Technology, Tehran, Iran in 2011. She
is now a researcher in the Computer Department of
National Iranian Oil Company and a member of
Young Researchers Club, Tehran, Iran.
Behrouz Minaei obtained his Ph. D. from
Computer Science and Engineering Department of
Michigan State University, USA, in the field of
Data Mining. Now, he is assistant professor in the
Department of Computer Engineering, Iran
University of Science & Technology. He is also
leader of Data and Text Mining Research Group in
Noor Computer Research Center, Qom, Iran,
developing large scale NLP and Text Mining
projects for Persian/Arabic language.
Omid Kashefi received his M. Sc. in Computer
Software Engineering from Iran University of
Science and Technology, Tehran, Iran in 2008. He
is a Lecturer and Research Associate in the School
of Computer Engineering at the Iran University of
Science and Technology. His major research
interests include distributed systems,
virtualization, system software and operating
systems, and natural language processing.
... Although the processing time would be increased by implementing a language model, a simple language model does not need much processing time. For example, it takes a few milliseconds to process short n-grams (e.g., 2-or 3-grams) [28,29]. This can lead to expanding the proposed system to word-and sentence-based recognition. ...
... Korean sign language letters with 14 consonants (1-14), 10 basic vowels(15)(16)(17)(18)(19)(20)(21)(22)(23)(24), and 7 double vowels(25)(26)(27)(28)(29)(30)(31). ...
Article
Full-text available
Recognition and understanding of sign language can aid communication between nondeaf and deaf people. Recently, research groups have developed sign language recognition algorithms using multiple sensors. However, in everyday life, minimizing the number of sensors would still require the use of a sign language interpreter. In this study, a sign language classification method was developed using an accelerometer to recognize the Korean sign language alphabet. The accelerometer is worn on the proximal phalanx of the index finger of the dominant hand. Triaxial accelerometer signals were used to segment the sign gesture (i.e., the time period when a user is performing a sign) and recognize the 31 Korean sign language letters (producing a chance level of 3.2%). The vector sum of the accelerometer signals was used to segment the sign gesture with 98.9% segmentation accuracy, which is comparable to that of previous multisensor systems (99.49%). The system was able to classify the Korean sign language alphabet with 92.2% accuracy. The recognition accuracy of this approach was found to be higher than that of a previous work in the same sign language alphabet classification task. The findings demonstrate that a single-sensor accelerometer with simple features can be reliably used for Korean sign language alphabet recognition in everyday life.
Article
Full-text available
The effectiveness of three stop words lists for Arabic Information Retrieval---General Stoplist, Corpus-Based Stoplist, Combined Stoplist ---were investigated in this study. Three popular weighting schemes were examined: the inverse document frequency weight, probabilistic weighting, and statistical language modelling. The Idea is to combine the statistical approaches with linguistic approaches to reach an optimal performance, and compare their effect on retrieval. The LDC (Linguistic Data Consortium) Arabic Newswire data set was used with the Lemur Toolkit. The Best Match weighting scheme used in the Okapi retrieval system had the best overall performance of the three weighting algorithms used in the study, stoplists improved retrieval effectiveness especially when used with the BM25 weight. The overall performance of a general stoplist was better than the other two lists.
Article
Full-text available
This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions.
Article
Full-text available
A series of experiments has been conducted on the Farsi language in the University of Tehran during past three years. The goal of these experiments was to establish the characteristics of Farsi text with respect to different retrieval models and weight schemes. These experiments included different Fuzzy methods and Vector Space, Probablistic and N-gram models of retrieval. The results show that vectore space model with ltu.Lnu weighting outperforms other models and weighting schemes.
Article
Article
Keyword extraction is an important research topic of information retrieval. This paper gave the specification of keywords in Chinese news documents based on analyzing linguistic characteristics of news documents and then proposed a new keyword extraction method based on tf/idf with multi-strategies. The approach selected candidate keywords of uni-, bi-and tri- grams, and then defines the features according to their morphological characters and context information. Moreover, the paper proposed several strategies to amend the incomplete words gotten from the word segmentation and found unknown potential keywords in news documents. Experimental results show that our proposed method can significantly outperform the baseline method. We also applied it to retrospective event detection. Experimental results show that the accuracy and efficiency of news retrospective event detection can be significantly improved.