Content uploaded by Ahsan Mahmood
Author content
All content in this area was uploaded by Ahsan Mahmood on Apr 02, 2018
Content may be subject to copyright.
Query based information retrieval and knowledge
extraction using Hadith datasets
Ahsan Mahmood
1
, Hikmat Ullah Khan
2
, Zahoor-ur-Rehman
1
, Wahab Khan
3
1
Computer Science Dept.
COMSATS Institute of Information Technology
Attock, Pakistan
2
Computer Science Dept.
COMSATS Institute of Information Technology
Wah, Pakistan
3
Department of Computer Science and Software Engineering,
International Islamic University, Islamabad 44000, Pakistan
Ahsan_mahmood_awan@yahoo.com, Hikmat.ullah@ciitwah.edu.pk, xahoor@ciit-attock.edu.pk, wahab.phdcs72@iiu.edu.pk
Abstract- In Natural language processing, one of the
fundamental tasks is Named Entity Recognition (NER) that
include identifying names of peoples, locations and other
entities. Applications of NER include catboats, speech
recognition, machine translation, knowledge extraction and
intelligent search systems. NER is an active research domain
for the last 10 years. In this paper, we propose a knowledge
extraction framework to extract Named entities from Sahih
AlBukhari Urdu translation which is a world known Hadith
book. The proposed framework is based on finite state
transducer system to extract entities and process the Hadith
content using Part of Speech (POS) tagging. Conditional
Random Field, an ensemble based algorithm, processes the
extracted nouns for NER and classification. In the future, we
aim to implement the proposed framework to rank the hadith
content and apply the Vector Space Model.
Index Terms---Data Extraction, Hadith Information Retrieval,
Framework, Named Entity Recognition.
I- INTRODUCTION
In the current era of Information Technology, A major part
of the data is in textual form and this data is increasing at a
very rapid rate because web users are shifting towards social
media where they share new textual data on daily basis [1].
This data is available in multiple formats at different places,
including news, social media, blogs, Wikipedia, articles, and
other places and is also available in multiple languages. This
huge increase in textual data on world wide web mostly
came after the birth of social media where millions of
people share their textual contents in different languages[2],
[3]. In computer human interaction, understanding the
natural language is one major task that is achieved though
the help of NLP [4]. Although to completely understand and
generate the natural language, first it is necessary to perform
the basic NLP like syntax analysis of language that includes
understanding of language morphology, part-of-speech
tagging, parsing, stemming etc. then perform the complex
part of NLP in the form of Named Entity Recognition
(NER), lexical semantics, sentiment analysis, machine
translation, natural language understand, natural language
generation, etc. NER is one of the challenging task in NLP
due to the unavailability or limitation of supervised training
data [5]. Another thing that make the NER task challenging
is the morphological richness[6]. NER involves extraction
of information like names of persons, locations,
organizations, etc. NER is very essential in NLP specially
when we want to extract the underlying knowledge from the
textual data as most of the important entities lies under
Named Entities (NE).In order to achieve the task, NER rely
on other NLP tasks such as Stemming, POS tagging and
morphological analysis[7]. These tasks are applied over the
textual data one by one in form of a framework to extracted
the underlying information from the data that leads to the
knowledge extraction.
In the process of knowledge discovery from the databases,
Islamic knowledge sources such as the Quran and Hadith
have high significance as huge amount of knowledge is
present in these 2 sources. We need to understand the need
for extracting the Islamic knowledge and the role of
Artificial Intelligence and DM and how we apply diverse
methods to extract the relevant knowledge from the Islamic
data. What we wanted to achieve is to access the underlying
knowledge in Islamic sources that will enable the user to
perform intelligent queries, searches and relational queries
over the knowledge system. As the data all around the world
is increasing at a very rapid rate we must extract the
knowledge of Islamic resources to distinguish between the
true and fabricated knowledge.
In this paper we approach NLP through Sahih Bukhari Urdu
translation book and our ultimate objective is to build an
Information Retrieval (IR) System in which users can query
and get the best results. Urdu is a morphologically rich
language and similar to Arabic and has its roots in Arabic
and Persian. Due to lack of capital words and morphological
complexities NLP in Urdu is challenging [8]. We will start
with a model based on Finite State Transducer (FST) to
extract entities of hadith and from extracted entities we will
process the textual part (Sanad and Matn) of hadith further
and tokenize the data as well as remove the stop words. The
tokenized words will then be used for part of speech tagging
through the help of Conditional Random Fields (CRF). In
tagged data Nouns will be processed further as most of the
important part of the data is present in noun so noun will be
used to extract NE’s from the tagged dataset. After
extraction of NE’s we will detect multi-words and divide
the extracted NE’s into different classified groups.
The paper is divided into following parts: Section-II
presents the background work related to this paper, Section-
III contains related Work; Section-IV describes our
proposed methodology divided into different parts.
II- BACKGROUND
Let us share our existing work in which we have developed
a Hadith Dataset Repository prepared after extracting data
from different hadith books in different languages including
English, Urdu and Arabic. The Dataset Repository is freely
available for research purposes. In our earlier work, we
selected different authenticated sources of hadith dataset of
different hadith books, including Sahih Muslim, Sahih Al-
Bukhara, Sunnah Abu Dawood, Mawta Imam Malik, etc.
Data was extracted by crawling different websites and
collecting the required data. In each book, the Hadiths are
managed in different formats including chapters, sections,
books, etc. We had to use different algorithms for different
books in order to extract all the components of that book.
As the data was divided into hierarchies we had to use
different algorithms to extract data from different books.
The algorithms were used along with the help of regular
expressions to parse each part of hadith separately into the
database.
III- RELATED WORK
Multiple systems for text mining have been created using
different techniques. NLP is one of the major fields of
application of the theory of automata.[9] Through NLP we
can retrieve information from textual data. In information
retrieval system one of the important task Is NER. NER is a
task that consists of identifying lexical units from the words,
it involves the detection of concrete entities and the type or
class to which the entity is referring to. In most of the NLP
systems, this has been an important information. [10] In the
field of NEE the Message Understanding Conference series
has one of the major forum for the researchers where the
researchers compare their work with each other in the field
of NER. The problem in NEE systems is that most of the
work done in this field is specific to English language or
LTR languages and in South Asian languages including
Arabic, Hindi, Urdu, etc. Some work has been done in this
field in the Arabic language has been presented as
ANERsys system that is built for Arabic textual data for
NER[11]. Mohamed et. Al proposed a hybrid approach for
NER in Arabic language. They used rule-based approach for
NER [12]. Shaalan k. presented a survey paper in which
they have discussed different kinds of work done in Arabic
NER and techniques used for this purpose[13].
Since the last decade, the researchers have been working in
NER for South Asian languages, using different techniques
and rule based systems. A system for NER in Punjabi
language have been developed using various types of rules
to achieve the tasks. The system gets good accuracy levels
from 82% to 89% [14]. Another system was built for NER
based on Hidden Markov Model which is language
independent and they developed the system for different
south Asian languages [15]. Another system was built for
NER in Nepali language using semi hybrid and rule based
approach. [16]
In recent years, south asian languages including Arabic,
Hindi, and Urdu have been under focus by researchers.
Urdu is the primary language of south Asia. Urdu script is
RTL same like Arabic and Persian language. [17]
International Joint Conference on Natural Language
Processing-08 is an important workshop that developed
NER systems for Indian and Urdu language. In this system
Statistical techniques or hybrid techniques to achieve NER
tasks.[18] In Urdu NLP different types of techniques were
used to achieve NEE. A model was built on the basis of
CRF or Conditional Random to achieve NER from urdu
language text.[19] later Mukand S. proposed another four
stages model based on CRF for Urdu language that achieved
the F-score of 68.9% [20]. Text processing applications
requires to recognize all the entities like names, numbers,
organizations, dates etc. and recognizing these entities
means a vital goal has been achieved in this field. [21]
According to Daud, due to the lack of capitalization features
in the Urdu language NER is harder in the Urdu language.
[8]
Hadith data have been used by many researchers for NLP
and NER. Harrag et. Al. [22] performed the Named entity
extraction from Hadith data that and his work include
extracting the lexical units in words, and other kinds of solid
entities Another System has been presented for extraction
of narrators from Hadith dataset through NER methods.[23]
Another work is presented on the extraction of the surface
information from hadith. They detected the entities and
words that contains knowledge from Hadith texts and their
system was built on Finite state transducers[24]. Ibrahim et
al [25] presented their work on hadith data in which they in
which they performed an analysis of Isnad from hadith text
and used the Isnad data for analysis of Hadith. Saloot et al
[26], presented their paper in which they have compared
different kind of NLP tasks applied over hadith data in
different languages. Our previous work also include work
on Islamic sources like Holy Quran. Previously we
proposed a system through which user can perform
Semantic search. The system was based on Ontology [27]
and another system for the same purpose based on WordNet
model [28]
IV- PROPOSED METHODOLGY
The aim of this research is to propose an Information
Retrieval System so users can query and get their results
according to the query. Some of the most important steps of
this research include, Extraction of Entities, Creation of
tokens, removal of stop words, part of speech tagging of
tokens, use tagged data for NER, Multi-word detection from
NER, and the creation of IR system from NER data. The
proposed hierarchy of entities to be extracted is shown in
Figure 1. We will explain each part of the process in details
in the next sections.
Figure 1 Hierarchy of Entities for detection as per Proposed Mode
A- Corpus of Sahih Al-bukhari
We have found the Sahih Bukhari Urdu translation complete
book from islamicurdubooks.com
1
. The book is available in
MS. Word format and divided into 7 separate files. Each file
contains about 1000 Hadiths and when combined the
complete book contains 7563 Hadiths. There are 3
hierarchal levels of the book. These are books, chapters in
books, and hadiths in chapters. Each hadith has multiple
parts that will be discussed in the next section.
B- Entity Extraction from Hadith
In order to process the book, it was required to extract each
and every Hadith data with reference to its Name of Book,
Book No, Name of Chapter, Chapter No, Hadith No, and
1
www.islamicurdubooks.com/download
then the information extraction from the Hadith Sanad &
Hadith Matn. For this purpose, we have used Finite State
Transducers. FST is most commonly used tool for Natural
language processing. An FST is a machine with basically 6
parts, it’s basically is a sequential machine with multiple
outputs possible. Its parts consist of:
• Ȉ denotes Input set
• ī denotes Output Set
• Q denotes Total States
• I denote initial States
• F denotes final states
• ǻ denotes transition relation
An example of proposed FST model is shown in
Figure 2
.
Figure 2 Proposed Finite State Transducer model
By using FST’s we have developed an algorithm to extract
all the entities from Hadith Dataset and saved each entity
separately into the database. The model shown in Figure 1
have been used to extract Book Number, Book Name,
Chapter Number, Chapter Name, Hadith Number, Hadith
Sanad, and Matn from the Hadith structure. An example of a
model that show all the components of Hadith in actual
format present in the book and extracted model in the form
of the pattern is shown in Figure 3. One thing to note here is
that as we have only processed the Hadith in Urdu language
so we have skipped the Arabic part i.e., Arabic Book Name,
Chapter Name, and Hadith text from the extraction model
and have only extracted the Urdu part of Hadith.
Figure 3 Information Extraction Model of Hadith
C- Part of Speech (POS) Tagging
After we have successfully extracted all the entities from
each of the hadith, the next step was to tag the data on the
basis of POS. For POS tag assignment we used CRF [29]
model. For CRF evaluation we make legal use of the Urdu
Digest POS Tagged dataset released by the Center for
Language Engineering (CLE dataset
2
) for research and
computational processing in Urdu. The CLE dataset
contains 100 K Urdu words from various areas.
Our approach applied the CRF approach for part of speech
tag labeling and prediction. In particular, we used the c-
sharp based open source libraries CRFSharp
3
., CRFSharp is
a new CRF package and we believe that it is more flexible
than the current advanced systems.
For CRF model performance testing we conducted tenfold
cross validation experiments on CLE dataset, CRF achieved
very good results as shown in Table 1.
Table 1 Experimental Result of POS Tagging
Type Result
Precision 96.44%
Recall 88.77%
F-Measure 92.41%
2
http://www.cle.org.pk/ accessed on 01-07-2017
3
https://github.com/zhongkaifu/CRFSharp accessed on 01-
07-2017
The features used in our 10-fold cross validation are given in
below as follows:
1. Previous lexical word
2. Current lexical word
3. Next lexical word
4. Current lexical word + Previous lexical word
5. Current lexical word + Next lexical word
After checking the CRF model performance we again
trained CRF model on the whole CLE dataset and supplied
it our data as tested data. The graphical representation of our
work of tagging is shown in Figure 4.
Figure 4 Graphical Representation of POS Tagging
An Example of our Tagged Hadith data is shown in Figure
5. In which we have used the hadith shown in Figure 3.
Figure 5 Hadith Sanad And Matn Tagged Example
D- Named Entity Extraction and Multiword Detection
After POS tagging was applied on the data the next step was
to extract the nouns from Sanad and Matn tagged
documents as shown in Figure 1. For this purpose, we
extract the information of Nouns from Sanad and Matn data.
This data of Nouns will be processed further in future to
extract named entities and multi-word detection. In named
entities extraction we will classify the extracted named
entities into different classes as given in the bottom of
Error! Reference source not found.
E- Proposed Information Retrieval System as Future Plan
For information retrieval purposes against user queries,
Vector space model (VSM) will be used. The VSM is an
algebraic model. It is used to represent text documents in
terms of vectors. It’s one of the major model use in
information retrieval. The model is also called term
frequency, inverse document frequency model. Our
Proposed IR system is shown in
Figure 6
Figure 6 Infor matio n Retrie val S ystem
Conclusion
In this paper our ultimate task is to create an Information
Retrieval System Based on the Hadith Data of Sahih
Bukhari. For this purpose, we have used different
algorithms and processes to achieve our target. Our work
includes applying FST on the Hadith data to extract entities,
CRF based tagger to tag the Sanad and Matn part of Hadith
and on tagged data we applied NER and classified those
named entities into different classes.
Our future work consists extracting Named Entities from the
Tagged document, Extraction of Multi-word detection from
those extracted named entities, and applying Vector space
model (VSM) on the data and build an Information
Retrieval System which will let the user query against our
model. We will also work further on our FST and POS
Tagging model to further improve our processing results.
R
EFERENCES
[1] R. Khan, H. U. Khan, M. S. Faisal, K. Iqbal, and M.
S. I. Malik, “An Analysis of Twitter users of
Pakistan,” Int. J. Comput. Sci. Inf. Secur., vol. 14, no.
8, p. 855, 2016.
[2] E. Cambria and B. White, “Jumping NLP Curves: A
Review of Natural Language Processing Research
[Review Article],” IEEE Comput. Intell. Mag., vol. 9,
no. 2, pp. 48–57, May 2014.
[3] C. M. S. Faisal, A. Daud, F. Imran, and S. Rho, “A
novel framework for social web forums’ thread
ranking based on semantics and post quality features,”
J. Supercomput., vol. 72, no. 11, pp. 4276–4295, Nov.
2016.
[4] G. Luo, X. Huang, C.-Y. Lin, and Z. Nie, “Joint
named entity recognition and disambiguation,” in
Proc. EMNLP, 2015, pp. 879–880.
[5] G. Lample, M. Ballesteros, S. Subramanian, K.
Kawakami, and C. Dyer, “Neural Architectures for
Named Entity Recognition,” ArXiv160301360 Cs,
Mar. 2016.
[6] M. Oudah and K. Shaalan, “NERA 2.0: Improving
coverage and performance of rule-based named entity
recognition for Arabic,” Nat. Lang. Eng., vol. 23, no.
3, pp. 441–472, May 2017.
[7] C. N. dos Santos and V. Guimarães, “Boosting
Named Entity Recognition with Neural Character
Embeddings,” ArXiv150505008 Cs, May 2015.
[8] A. Daud, W. Khan, and D. Che, “Urdu language
processing: a survey,” Artif. Intell. Rev., vol. 47, no.
3, pp. 279–311, Mar. 2017.
[9] A. Maletti, “Survey: Finite-state technology in natural
language processing,” Theor. Comput. Sci., vol. 679,
pp. 2–17, May 2017.
[10] M. Padró and L. Padró, “A named entity recognition
system based on a finite automata acquisition
algorithm,” Proces. Leng. Nat., vol. 35, pp. 319–326,
2005.
[11] Y. Benajiba, P. Rosso, and J. M. BenedíRuiz,
“ANERsys: An Arabic Named Entity Recognition
System Based on Maximum Entropy,” in
Computational Linguistics and Intelligent Text
Processing, 2007, pp. 143–153.
[12] M. A. Meselhi, H. M. A. Bakr, I. Ziedan, and K.
Shaalan, “A Novel Hybrid Approach to Arabic
Named Entity Recognition,” in Machine Translation,
vol. 493, X. Shi and Y. Chen, Eds. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2014, pp.
93–103.
[13] K. Shaalan, “A Survey of Arabic Named Entity
Recognition and Classification,” Comput. Linguist.,
vol. 40, no. 2, pp. 469–510, Jun. 2014.
[14] G. Vishal and S. L. Gurpreet, “Named Entity
Recognition for Punjabi Language Text
Summarization,” Int. J. Comput. Appl., vol. 33, no. 3,
pp. 28–32, Nov. 2011.
[15] S. Morwal, N. Jahan, and D. Chopra, “Named Entity
Recognition using Hidden Markov Model (HMM),”
ResearchGate, vol. 1, no. 4, pp. 15–23, Dec. 2012.
[16] A. Dey, A. Paul, and B. Purkayastha, “Named Entity
Recognition for Nepali language: A Semi Hybrid
Approach,” Int. J. Eng. Innov. Technol., vol. 3, 2014.
[17] W. Anwar, X. Wang, and X. l Wang, “A Survey of
Automatic Urdu Language Processing,” in 2006
International Conference on Machine Learning and
Cybernetics, 2006, pp. 4489–4494.
[18] U. Singh, V. Goyal, and G. S. Lehal, “Named Entity
Recognition System for Urdu.,” in COLING, 2012,
pp. 2507–2518.
[19] A. Ekbal and S. Bandyopadhyay, “Named entity
recognition in Bengali using system combination,”
Lingvisticae Investig., vol. 37, no. 1, pp. 1–22, Jan.
2014.
[20] S. Mukund and R. K. Srihari, “NE Tagging for Urdu
Based on Bootstrap POS Learning,” in Proceedings of
the Third International Workshop on Cross Lingual
Information Access: Addressing the Information Need
of Multilingual Societies, Stroudsburg, PA, USA,
2009, pp. 61–69.
[21] K. Riaz, “Rule-based Named Entity Recognition in
Urdu,” in Proceedings of the 2010 Named Entities
Workshop, Stroudsburg, PA, USA, 2010, pp. 126–
135.
[22] F. Harrag, E. El-Qawasmeh, and A. M. S. Al-Salman,
“Extracting Named Entities from Prophetic Narration
Texts (Hadith),” in Software Engineering and
Computer Systems, 2011, pp. 289–297.
[23] A. Muazzam, Siddiqui, E.-S. Mostafa, A. Saleh, and
Bagais, “Extraction and Visualization of the Chain of
Narrators from Hadiths using Named Entity
Recognition and Classification,” ResearchGate, vol.
5, no. 1, Apr. 2014.
[24] F. Harrag, “Text mining approach for knowledge
extraction in Sahîh Al-Bukhari,” Comput. Hum.
Behav., vol. 30, pp. 558–566, Jan. 2014.
[25] N. K. Ibrahim, M. F. Noordin, S. Samsuri, M. S. A.
Seman, and A. E. B. Ali, “Isnad Al-Hadith
Computational Authentication: An Analysis
Hierarchically,” in 2016 6th International Conference
on Information and Communication Technology for
The Muslim World (ICT4M), 2016, pp. 344–348.
[26] M. A. Saloot, N. Idris, R. Mahmud, S. Ja’afar, D.
Thorleuchter, and A. Gani, “Hadith data mining and
classification: a comparative analysis,” Artif. Intell.
Rev., vol. 46, no. 1, pp. 113–128, Jun. 2016.
[27] H. U. Khan, S. M. Saqlain, M. Shoaib, and M. Sher,
“Ontology Based Semantic Search in Holy Quran,”
Int. J. Future Comput. Commun., vol. 2, no. 6, pp.
570–575, 2013.
[28] M. Shoaib, M. N. Yasin, U. K. Hikmat, M. I. Saeed,
and M. S. H. Khiyal, “Relational WordNet model for
semantic search in Holy Quran,” in 2009
International Conference on Emerging Technologies,
2009, pp. 29–34.
[29] J. Lafferty, A. McCallum, and F. Pereira,
“Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data,” Dep. Pap.
CIS, Jun. 2001.