Conference PaperPDF Available

Query based information retrieval and knowledge extraction using Hadith datasets

Authors:

Abstract and Figures

In Natural language processing, one of the fundamental tasks is Named Entity Recognition (NER) that include identifying names of peoples, locations and other entities. Applications of NER include catboats, speech recognition, machine translation, knowledge extraction and intelligent search systems. NER is an active research domain for the last 10 years. In this paper, we propose a knowledge extraction framework to extract Named entities from Sahih AlBukhari Urdu translation which is a world known Hadith book. The proposed framework is based on finite state transducer system to extract entities and process the Hadith content using Part of Speech (POS) tagging. Conditional Random Field, an ensemble based algorithm, processes the extracted nouns for NER and classification. In the future, we aim to implement the proposed framework to rank the hadith content and apply the Vector Space Model.
Content may be subject to copyright.
Query based information retrieval and knowledge
extraction using Hadith datasets
Ahsan Mahmood
1
, Hikmat Ullah Khan
2
, Zahoor-ur-Rehman
1
, Wahab Khan
3
1
Computer Science Dept.
COMSATS Institute of Information Technology
Attock, Pakistan
2
Computer Science Dept.
COMSATS Institute of Information Technology
Wah, Pakistan
3
Department of Computer Science and Software Engineering,
International Islamic University, Islamabad 44000, Pakistan
Ahsan_mahmood_awan@yahoo.com, Hikmat.ullah@ciitwah.edu.pk, xahoor@ciit-attock.edu.pk, wahab.phdcs72@iiu.edu.pk
Abstract- In Natural language processing, one of the
fundamental tasks is Named Entity Recognition (NER) that
include identifying names of peoples, locations and other
entities. Applications of NER include catboats, speech
recognition, machine translation, knowledge extraction and
intelligent search systems. NER is an active research domain
for the last 10 years. In this paper, we propose a knowledge
extraction framework to extract Named entities from Sahih
AlBukhari Urdu translation which is a world known Hadith
book. The proposed framework is based on finite state
transducer system to extract entities and process the Hadith
content using Part of Speech (POS) tagging. Conditional
Random Field, an ensemble based algorithm, processes the
extracted nouns for NER and classification. In the future, we
aim to implement the proposed framework to rank the hadith
content and apply the Vector Space Model.
Index Terms---Data Extraction, Hadith Information Retrieval,
Framework, Named Entity Recognition.
I- INTRODUCTION
In the current era of Information Technology, A major part
of the data is in textual form and this data is increasing at a
very rapid rate because web users are shifting towards social
media where they share new textual data on daily basis [1].
This data is available in multiple formats at different places,
including news, social media, blogs, Wikipedia, articles, and
other places and is also available in multiple languages. This
huge increase in textual data on world wide web mostly
came after the birth of social media where millions of
people share their textual contents in different languages[2],
[3]. In computer human interaction, understanding the
natural language is one major task that is achieved though
the help of NLP [4]. Although to completely understand and
generate the natural language, first it is necessary to perform
the basic NLP like syntax analysis of language that includes
understanding of language morphology, part-of-speech
tagging, parsing, stemming etc. then perform the complex
part of NLP in the form of Named Entity Recognition
(NER), lexical semantics, sentiment analysis, machine
translation, natural language understand, natural language
generation, etc. NER is one of the challenging task in NLP
due to the unavailability or limitation of supervised training
data [5]. Another thing that make the NER task challenging
is the morphological richness[6]. NER involves extraction
of information like names of persons, locations,
organizations, etc. NER is very essential in NLP specially
when we want to extract the underlying knowledge from the
textual data as most of the important entities lies under
Named Entities (NE).In order to achieve the task, NER rely
on other NLP tasks such as Stemming, POS tagging and
morphological analysis[7]. These tasks are applied over the
textual data one by one in form of a framework to extracted
the underlying information from the data that leads to the
knowledge extraction.
In the process of knowledge discovery from the databases,
Islamic knowledge sources such as the Quran and Hadith
have high significance as huge amount of knowledge is
present in these 2 sources. We need to understand the need
for extracting the Islamic knowledge and the role of
Artificial Intelligence and DM and how we apply diverse
methods to extract the relevant knowledge from the Islamic
data. What we wanted to achieve is to access the underlying
knowledge in Islamic sources that will enable the user to
perform intelligent queries, searches and relational queries
over the knowledge system. As the data all around the world
is increasing at a very rapid rate we must extract the
knowledge of Islamic resources to distinguish between the
true and fabricated knowledge.
In this paper we approach NLP through Sahih Bukhari Urdu
translation book and our ultimate objective is to build an
Information Retrieval (IR) System in which users can query
and get the best results. Urdu is a morphologically rich
language and similar to Arabic and has its roots in Arabic
and Persian. Due to lack of capital words and morphological
complexities NLP in Urdu is challenging [8]. We will start
with a model based on Finite State Transducer (FST) to
extract entities of hadith and from extracted entities we will
process the textual part (Sanad and Matn) of hadith further
and tokenize the data as well as remove the stop words. The
tokenized words will then be used for part of speech tagging
through the help of Conditional Random Fields (CRF). In
tagged data Nouns will be processed further as most of the
important part of the data is present in noun so noun will be
used to extract NE’s from the tagged dataset. After
extraction of NE’s we will detect multi-words and divide
the extracted NE’s into different classified groups.
The paper is divided into following parts: Section-II
presents the background work related to this paper, Section-
III contains related Work; Section-IV describes our
proposed methodology divided into different parts.
II- BACKGROUND
Let us share our existing work in which we have developed
a Hadith Dataset Repository prepared after extracting data
from different hadith books in different languages including
English, Urdu and Arabic. The Dataset Repository is freely
available for research purposes. In our earlier work, we
selected different authenticated sources of hadith dataset of
different hadith books, including Sahih Muslim, Sahih Al-
Bukhara, Sunnah Abu Dawood, Mawta Imam Malik, etc.
Data was extracted by crawling different websites and
collecting the required data. In each book, the Hadiths are
managed in different formats including chapters, sections,
books, etc. We had to use different algorithms for different
books in order to extract all the components of that book.
As the data was divided into hierarchies we had to use
different algorithms to extract data from different books.
The algorithms were used along with the help of regular
expressions to parse each part of hadith separately into the
database.
III- RELATED WORK
Multiple systems for text mining have been created using
different techniques. NLP is one of the major fields of
application of the theory of automata.[9] Through NLP we
can retrieve information from textual data. In information
retrieval system one of the important task Is NER. NER is a
task that consists of identifying lexical units from the words,
it involves the detection of concrete entities and the type or
class to which the entity is referring to. In most of the NLP
systems, this has been an important information. [10] In the
field of NEE the Message Understanding Conference series
has one of the major forum for the researchers where the
researchers compare their work with each other in the field
of NER. The problem in NEE systems is that most of the
work done in this field is specific to English language or
LTR languages and in South Asian languages including
Arabic, Hindi, Urdu, etc. Some work has been done in this
field in the Arabic language has been presented as
ANERsys system that is built for Arabic textual data for
NER[11]. Mohamed et. Al proposed a hybrid approach for
NER in Arabic language. They used rule-based approach for
NER [12]. Shaalan k. presented a survey paper in which
they have discussed different kinds of work done in Arabic
NER and techniques used for this purpose[13].
Since the last decade, the researchers have been working in
NER for South Asian languages, using different techniques
and rule based systems. A system for NER in Punjabi
language have been developed using various types of rules
to achieve the tasks. The system gets good accuracy levels
from 82% to 89% [14]. Another system was built for NER
based on Hidden Markov Model which is language
independent and they developed the system for different
south Asian languages [15]. Another system was built for
NER in Nepali language using semi hybrid and rule based
approach. [16]
In recent years, south asian languages including Arabic,
Hindi, and Urdu have been under focus by researchers.
Urdu is the primary language of south Asia. Urdu script is
RTL same like Arabic and Persian language. [17]
International Joint Conference on Natural Language
Processing-08 is an important workshop that developed
NER systems for Indian and Urdu language. In this system
Statistical techniques or hybrid techniques to achieve NER
tasks.[18] In Urdu NLP different types of techniques were
used to achieve NEE. A model was built on the basis of
CRF or Conditional Random to achieve NER from urdu
language text.[19] later Mukand S. proposed another four
stages model based on CRF for Urdu language that achieved
the F-score of 68.9% [20]. Text processing applications
requires to recognize all the entities like names, numbers,
organizations, dates etc. and recognizing these entities
means a vital goal has been achieved in this field. [21]
According to Daud, due to the lack of capitalization features
in the Urdu language NER is harder in the Urdu language.
[8]
Hadith data have been used by many researchers for NLP
and NER. Harrag et. Al. [22] performed the Named entity
extraction from Hadith data that and his work include
extracting the lexical units in words, and other kinds of solid
entities Another System has been presented for extraction
of narrators from Hadith dataset through NER methods.[23]
Another work is presented on the extraction of the surface
information from hadith. They detected the entities and
words that contains knowledge from Hadith texts and their
system was built on Finite state transducers[24]. Ibrahim et
al [25] presented their work on hadith data in which they in
which they performed an analysis of Isnad from hadith text
and used the Isnad data for analysis of Hadith. Saloot et al
[26], presented their paper in which they have compared
different kind of NLP tasks applied over hadith data in
different languages. Our previous work also include work
on Islamic sources like Holy Quran. Previously we
proposed a system through which user can perform
Semantic search. The system was based on Ontology [27]
and another system for the same purpose based on WordNet
model [28]
IV- PROPOSED METHODOLGY
The aim of this research is to propose an Information
Retrieval System so users can query and get their results
according to the query. Some of the most important steps of
this research include, Extraction of Entities, Creation of
tokens, removal of stop words, part of speech tagging of
tokens, use tagged data for NER, Multi-word detection from
NER, and the creation of IR system from NER data. The
proposed hierarchy of entities to be extracted is shown in
Figure 1. We will explain each part of the process in details
in the next sections.
Figure 1 Hierarchy of Entities for detection as per Proposed Mode
A- Corpus of Sahih Al-bukhari
We have found the Sahih Bukhari Urdu translation complete
book from islamicurdubooks.com
1
. The book is available in
MS. Word format and divided into 7 separate files. Each file
contains about 1000 Hadiths and when combined the
complete book contains 7563 Hadiths. There are 3
hierarchal levels of the book. These are books, chapters in
books, and hadiths in chapters. Each hadith has multiple
parts that will be discussed in the next section.
B- Entity Extraction from Hadith
In order to process the book, it was required to extract each
and every Hadith data with reference to its Name of Book,
Book No, Name of Chapter, Chapter No, Hadith No, and
1
www.islamicurdubooks.com/download
then the information extraction from the Hadith Sanad &
Hadith Matn. For this purpose, we have used Finite State
Transducers. FST is most commonly used tool for Natural
language processing. An FST is a machine with basically 6
parts, it’s basically is a sequential machine with multiple
outputs possible. Its parts consist of:
Ȉ denotes Input set
ī denotes Output Set
Q denotes Total States
I denote initial States
F denotes final states
ǻ denotes transition relation
An example of proposed FST model is shown in
Figure 2
.
Figure 2 Proposed Finite State Transducer model
By using FST’s we have developed an algorithm to extract
all the entities from Hadith Dataset and saved each entity
separately into the database. The model shown in Figure 1
have been used to extract Book Number, Book Name,
Chapter Number, Chapter Name, Hadith Number, Hadith
Sanad, and Matn from the Hadith structure. An example of a
model that show all the components of Hadith in actual
format present in the book and extracted model in the form
of the pattern is shown in Figure 3. One thing to note here is
that as we have only processed the Hadith in Urdu language
so we have skipped the Arabic part i.e., Arabic Book Name,
Chapter Name, and Hadith text from the extraction model
and have only extracted the Urdu part of Hadith.
Figure 3 Information Extraction Model of Hadith
C- Part of Speech (POS) Tagging
After we have successfully extracted all the entities from
each of the hadith, the next step was to tag the data on the
basis of POS. For POS tag assignment we used CRF [29]
model. For CRF evaluation we make legal use of the Urdu
Digest POS Tagged dataset released by the Center for
Language Engineering (CLE dataset
2
) for research and
computational processing in Urdu. The CLE dataset
contains 100 K Urdu words from various areas.
Our approach applied the CRF approach for part of speech
tag labeling and prediction. In particular, we used the c-
sharp based open source libraries CRFSharp
3
., CRFSharp is
a new CRF package and we believe that it is more flexible
than the current advanced systems.
For CRF model performance testing we conducted tenfold
cross validation experiments on CLE dataset, CRF achieved
very good results as shown in Table 1.
Table 1 Experimental Result of POS Tagging
Type Result
Precision 96.44%
Recall 88.77%
F-Measure 92.41%
2
http://www.cle.org.pk/ accessed on 01-07-2017
3
https://github.com/zhongkaifu/CRFSharp accessed on 01-
07-2017
The features used in our 10-fold cross validation are given in
below as follows:
1. Previous lexical word
2. Current lexical word
3. Next lexical word
4. Current lexical word + Previous lexical word
5. Current lexical word + Next lexical word
After checking the CRF model performance we again
trained CRF model on the whole CLE dataset and supplied
it our data as tested data. The graphical representation of our
work of tagging is shown in Figure 4.
Figure 4 Graphical Representation of POS Tagging
An Example of our Tagged Hadith data is shown in Figure
5. In which we have used the hadith shown in Figure 3.
Figure 5 Hadith Sanad And Matn Tagged Example
D- Named Entity Extraction and Multiword Detection
After POS tagging was applied on the data the next step was
to extract the nouns from Sanad and Matn tagged
documents as shown in Figure 1. For this purpose, we
extract the information of Nouns from Sanad and Matn data.
This data of Nouns will be processed further in future to
extract named entities and multi-word detection. In named
entities extraction we will classify the extracted named
entities into different classes as given in the bottom of
Error! Reference source not found.
E- Proposed Information Retrieval System as Future Plan
For information retrieval purposes against user queries,
Vector space model (VSM) will be used. The VSM is an
algebraic model. It is used to represent text documents in
terms of vectors. It’s one of the major model use in
information retrieval. The model is also called term
frequency, inverse document frequency model. Our
Proposed IR system is shown in
Figure 6
Figure 6 Infor matio n Retrie val S ystem
Conclusion
In this paper our ultimate task is to create an Information
Retrieval System Based on the Hadith Data of Sahih
Bukhari. For this purpose, we have used different
algorithms and processes to achieve our target. Our work
includes applying FST on the Hadith data to extract entities,
CRF based tagger to tag the Sanad and Matn part of Hadith
and on tagged data we applied NER and classified those
named entities into different classes.
Our future work consists extracting Named Entities from the
Tagged document, Extraction of Multi-word detection from
those extracted named entities, and applying Vector space
model (VSM) on the data and build an Information
Retrieval System which will let the user query against our
model. We will also work further on our FST and POS
Tagging model to further improve our processing results.
R
EFERENCES
[1] R. Khan, H. U. Khan, M. S. Faisal, K. Iqbal, and M.
S. I. Malik, “An Analysis of Twitter users of
Pakistan,” Int. J. Comput. Sci. Inf. Secur., vol. 14, no.
8, p. 855, 2016.
[2] E. Cambria and B. White, “Jumping NLP Curves: A
Review of Natural Language Processing Research
[Review Article],” IEEE Comput. Intell. Mag., vol. 9,
no. 2, pp. 48–57, May 2014.
[3] C. M. S. Faisal, A. Daud, F. Imran, and S. Rho, “A
novel framework for social web forums’ thread
ranking based on semantics and post quality features,”
J. Supercomput., vol. 72, no. 11, pp. 4276–4295, Nov.
2016.
[4] G. Luo, X. Huang, C.-Y. Lin, and Z. Nie, “Joint
named entity recognition and disambiguation,” in
Proc. EMNLP, 2015, pp. 879–880.
[5] G. Lample, M. Ballesteros, S. Subramanian, K.
Kawakami, and C. Dyer, “Neural Architectures for
Named Entity Recognition,” ArXiv160301360 Cs,
Mar. 2016.
[6] M. Oudah and K. Shaalan, “NERA 2.0: Improving
coverage and performance of rule-based named entity
recognition for Arabic,” Nat. Lang. Eng., vol. 23, no.
3, pp. 441–472, May 2017.
[7] C. N. dos Santos and V. Guimarães, “Boosting
Named Entity Recognition with Neural Character
Embeddings,” ArXiv150505008 Cs, May 2015.
[8] A. Daud, W. Khan, and D. Che, “Urdu language
processing: a survey,” Artif. Intell. Rev., vol. 47, no.
3, pp. 279–311, Mar. 2017.
[9] A. Maletti, “Survey: Finite-state technology in natural
language processing,” Theor. Comput. Sci., vol. 679,
pp. 2–17, May 2017.
[10] M. Padró and L. Padró, “A named entity recognition
system based on a finite automata acquisition
algorithm,” Proces. Leng. Nat., vol. 35, pp. 319–326,
2005.
[11] Y. Benajiba, P. Rosso, and J. M. BenedíRuiz,
“ANERsys: An Arabic Named Entity Recognition
System Based on Maximum Entropy,” in
Computational Linguistics and Intelligent Text
Processing, 2007, pp. 143–153.
[12] M. A. Meselhi, H. M. A. Bakr, I. Ziedan, and K.
Shaalan, “A Novel Hybrid Approach to Arabic
Named Entity Recognition,” in Machine Translation,
vol. 493, X. Shi and Y. Chen, Eds. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2014, pp.
93–103.
[13] K. Shaalan, “A Survey of Arabic Named Entity
Recognition and Classification,” Comput. Linguist.,
vol. 40, no. 2, pp. 469–510, Jun. 2014.
[14] G. Vishal and S. L. Gurpreet, “Named Entity
Recognition for Punjabi Language Text
Summarization,” Int. J. Comput. Appl., vol. 33, no. 3,
pp. 28–32, Nov. 2011.
[15] S. Morwal, N. Jahan, and D. Chopra, “Named Entity
Recognition using Hidden Markov Model (HMM),”
ResearchGate, vol. 1, no. 4, pp. 15–23, Dec. 2012.
[16] A. Dey, A. Paul, and B. Purkayastha, “Named Entity
Recognition for Nepali language: A Semi Hybrid
Approach,” Int. J. Eng. Innov. Technol., vol. 3, 2014.
[17] W. Anwar, X. Wang, and X. l Wang, “A Survey of
Automatic Urdu Language Processing,” in 2006
International Conference on Machine Learning and
Cybernetics, 2006, pp. 4489–4494.
[18] U. Singh, V. Goyal, and G. S. Lehal, “Named Entity
Recognition System for Urdu.,” in COLING, 2012,
pp. 2507–2518.
[19] A. Ekbal and S. Bandyopadhyay, “Named entity
recognition in Bengali using system combination,”
Lingvisticae Investig., vol. 37, no. 1, pp. 1–22, Jan.
2014.
[20] S. Mukund and R. K. Srihari, “NE Tagging for Urdu
Based on Bootstrap POS Learning,” in Proceedings of
the Third International Workshop on Cross Lingual
Information Access: Addressing the Information Need
of Multilingual Societies, Stroudsburg, PA, USA,
2009, pp. 61–69.
[21] K. Riaz, “Rule-based Named Entity Recognition in
Urdu,” in Proceedings of the 2010 Named Entities
Workshop, Stroudsburg, PA, USA, 2010, pp. 126–
135.
[22] F. Harrag, E. El-Qawasmeh, and A. M. S. Al-Salman,
“Extracting Named Entities from Prophetic Narration
Texts (Hadith),” in Software Engineering and
Computer Systems, 2011, pp. 289–297.
[23] A. Muazzam, Siddiqui, E.-S. Mostafa, A. Saleh, and
Bagais, “Extraction and Visualization of the Chain of
Narrators from Hadiths using Named Entity
Recognition and Classification,” ResearchGate, vol.
5, no. 1, Apr. 2014.
[24] F. Harrag, “Text mining approach for knowledge
extraction in Sahîh Al-Bukhari,” Comput. Hum.
Behav., vol. 30, pp. 558–566, Jan. 2014.
[25] N. K. Ibrahim, M. F. Noordin, S. Samsuri, M. S. A.
Seman, and A. E. B. Ali, “Isnad Al-Hadith
Computational Authentication: An Analysis
Hierarchically,” in 2016 6th International Conference
on Information and Communication Technology for
The Muslim World (ICT4M), 2016, pp. 344–348.
[26] M. A. Saloot, N. Idris, R. Mahmud, S. Ja’afar, D.
Thorleuchter, and A. Gani, “Hadith data mining and
classification: a comparative analysis,” Artif. Intell.
Rev., vol. 46, no. 1, pp. 113–128, Jun. 2016.
[27] H. U. Khan, S. M. Saqlain, M. Shoaib, and M. Sher,
“Ontology Based Semantic Search in Holy Quran,”
Int. J. Future Comput. Commun., vol. 2, no. 6, pp.
570–575, 2013.
[28] M. Shoaib, M. N. Yasin, U. K. Hikmat, M. I. Saeed,
and M. S. H. Khiyal, “Relational WordNet model for
semantic search in Holy Quran,” in 2009
International Conference on Emerging Technologies,
2009, pp. 29–34.
[29] J. Lafferty, A. McCallum, and F. Pereira,
“Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data,” Dep. Pap.
CIS, Jun. 2001.
... Named Entity Recognition (NER) is a task that involves the identification of lexical units and the detection of specific entities and their corresponding types or classes [13]. NER recognizes noun entities such as names, dates or times, locations, and medicine names [9]. ...
... [30] employed chunk grammar to extract the narrators' names in the sanad part. Meanwhile, [13] employed a Finite State Transducer (FST) to extract entities such as Book Number, Book Name, Chapter Number, Chapter Name, Hadith Number, Hadith Sanad, and Matan from hadith structures and used Conditional Random Fields (CRF), as well as probabilistic models for segmenting and labelling sequence data to tag the extracted entities. But most researchers have not publicly shared their corpora [18]. ...
... For collecting and investigating mysterious log data created through user interaction with web IR structures, annoying opportunities are offered by the web evolution [10]. For learning users' requirements and behaviours, such data offers a good source and enables the model of user-oriented web IR structures [11]. ...
Article
Full-text available
In the growing information retrieval (IR) world, selecting suitable keywords and generating queries is important for effective retrieval. Modern database applications need a sophisticated interface for automatically updating the connections between users and databases. Most database applications are intelligent; however, some may be complex to understand when generating queries for effective retrieval. Therefore, this paper develops a new Adaptive Naked Mole-Rate algorithm (ANMR) for an automatic query generation (AQG) based IR system. The query generation approach primarily generates the query with expanded keywords to enhance IR. Modified sealion optimization (MSO) is applied to select several features. The selected features are combined in the feature fusion process using the Fuzzy based Search and Rescue approach (FSR). Similarity matching is performed using the hybrid cosine and Jaccard similarity measures. At last, the ranking process is performed using the Dynamic Global Local Attention Network based on Capsules (DGLANC). The developed AQG-ANMR system improves the performance of the entire information retrieval system. Then, to analyze the performance, the proposed approach is implemented in the python platform and evaluated in terms of accuracy, recall, precision, and F1-Score by employing TREC-3 and CISI datasets. Besides, the performance of the proposed approach is compared with that of state-of-the-art approaches. Finally, the simulated results clearly showed that the proposed approach outperformed the state-of-the-art approaches better. The maximum accuracy attained by the proposed approach is 97.2% with AQG-ANMR and 94.69% without AQG-ANMR for the TREC 3 dataset. In the same way, using the CISI dataset, the proposed approach reached a maximum accuracy of 98.6% with AQG-ANMR and 95.8% without AQG-ANMR, respectively.
... It is worth noting that this work was conducted using a relatively small dataset consisting of only 235 hadiths. In their research, (Mahmood et al., 2017) put forth a knowledge extraction framework that uses a Finite State Transducer system to extract named entities from the Urdu translation of Sahih Bukhari. The framework leverages POS tagging and the Conditional Random Field (CRF) algorithm to process the Hadith content, extract nouns for NER, and classify named entities into different categories. ...
Conference Paper
Full-text available
This paper presents a comprehensive review of the recent Sanad-based studies in Hadith literature. The review is organized into four main categories: automated Hadith classification for authenticity and reliability assessment, building and analysing Hadith narrator networks, identifying and extracting key components from Hadith text (Sanad and names), and construction and development of Sanad datasets and ontologies. Our review examines various methods used in automated Hadith classification, including expert systems, data mining, and machine learning, discussing their limitations. The analysis of Hadith narrator networks using techniques like social network analysis and graph theory provides insights into authenticity and credibility. Natural Language Processing techniques like Hidden Markov Models and Word Sense Disambiguation are employed for extracting key components from Hadith text. The construction of Sanad datasets and ontologies is discussed as valuable resources for facilitating information retrieval and analysis. This review summarizes the key findings and limitations of the studies in each category. It highlights the potential of Sanad-based studies for future research in Hadith studies and emphasizes the need for further refinement and validation of the proposed methods. Keywords: Sanad, Transmission Chain, Hadith, Sanad Analysis, Hadith Ontology, Dataset
... The acquisition of a substantial number of authentic samples, their subsequent collection and labelling, is commonly acknowledged as a labor-intensive, costly, and error-prone process Moreover, existing datasets frequently experience a disparity in data distribution [13]. Cyber-Physical Systems (CPS) integrates physical devices with computing and communication technologies, creating environments where human activities can be monitored and analysed for various applications, such as smart environments, healthcare, and security systems [14,15]. The manuscript is arranged as the text organized as below: Section 2 analyzes literature review. ...
Article
Full-text available
The Information Retrieval system aims to discover relevant documents and display them as query responses. However, the ever-changing nature of user queries poses a substantial research problem in defining the necessary data to respond accurately. The Major intention for this study is for enhance the retrieval of relevant information in response to user queries. The aim to develop an advanced IR system that adapts to changing user requirements. By introducing WMO_DBN, we seek to improve the efficiency and accuracy of information retrieval, catering to both general and specific user searches. The proposed methodology comprises three important steps: pre-processing, feature choice, and categorization. Initially, unstructured data subject to pre-processing to transform it into a structured format. Subsequently, relevant features are selected to optimize the retrieval process. The final step involves the utilization of WMO_DBN, a novel deep learning model designed for information retrieval based on the query data. Additionally, similarity calculation is employed to improve the effectiveness for the network training model. The investigational evaluation for the suggested model was conducted, and its performance is measured regarding the metrics of recall, precision, accuracy, and F1 score, the present discourse concerns their significance within the academic realm. The results prove the superiority of WMO_DBN in retrieving relevant information compared to traditional approaches. This research introduces novel method for addressing the challenges in information retrieval with the integration of WMO_DBN. By applying pre-processing, feature selection, and a deep belief neural network, the proposed system achieves more accurate and efficient retrieval of relevant information. The study contributes to the advancement of information retrieval systems and emphasizes the importance of adapting to users' evolving search queries. The success of WMO_DBN in retrieving relevant information highlights its potential for enhancing information retrieval processes in various applications.
... PoS tagging aims to label each word in a given text with its PoS tag, e.g., noun, verb, adjective, adverb, etc. It parses input text to assist downstream tasks, including syntactic tasks, e.g., text chunking [11,12] and dependency parsing [13,14], as well as high-level NLP tasks, e.g., information retrieval [15] and sentiment analysis [16,17], and metaphor interpretation [18][19][20]. These three tasks are all commonly regarded as a sequence labeling problem. Figure 1 shows an example of the different task labels given an input sentence. ...
Preprint
Full-text available
Syntactic processing techniques are the foundation of Natural Language Processing (NLP), supporting many downstream NLP tasks. In this paper, we conduct pair-wise Multi-Task Learning (MTL) on syntactic tasks with different granularity, namely Sentence Boundary Detection (SBD), text chunking, and Part-of-Speech(PoS) tagging, so as to investigate the extent to which they complement each other. We propose a novel soft parameter sharing mechanism to share local and global dependency information that is learned from both target tasks. We also propose a Curriculum Learning (CL) mechanism to improve MTL with non-parallel labeled data. Using non-parallel labeled data in MTL is a common practice, whereas it has not received enough attention before. For example, our employed PoS tagging data do not have text chunking labels. When learning PoS tagging and text chunking together, the proposed CL mechanism aims to select complementary samples from the two tasks to update the parameters of the MTL model in the same training batch. Such a method yields better performance and learning stability. We conclude that the fine-grained tasks can provide complementary features to coarse-grained ones, while the most coarse-grained task, SBD, provides useful information for the most fine-grained one, PoS tagging. Additionally, the text chunking task achieves state-of-the-art performance when joint learning with PoS tagging. Our analytical experiments also show the effectiveness of the proposed soft parameter sharing and CL mechanisms.
Chapter
The application of natural language processing (NLP) methods in textual analysis for information retrieval is examined in this work. In this research, an overview of the significance and function of text processing in information retrieval comes first, then comes text pre-processing techniques such as stop-word deletion, stemming, and tokenization. In addition, various NLP methods, including sentiment, component identification, and named entity recognition, are being researched. This research then examines various text representation methods, including language models, TF-IDF, and bag-of-words. The act of analyzing disorganized text and turning it into valuable data for analysis to get a quantifiable figure that contains some essential information is known as “text analytics.” Businesses are using text analysis more and more frequently. It aids in the analysis of unstructured data, such as customer reviews, as well as the discovery of patterns and the forecasting of trends. The technologies for text analysis that are offered for transforming text information into useful data for analysis include systems, libraries, analysis, automated process programmers, data collection, and extraction-based tools, to name just a few. The fundamentals of textual data, various text mining approaches, and the most widely used text analysis tools will all be covered in this research. We look at Naive Bayes, deep learning, and support machines for text classification. This research examines the use of natural language processing (NLP) techniques in textual analysis for information retrieval. The research covers information retrieval systems, natural language processing (NLP) methods, captioning, phrase-based systems, text clustering, text similarity metrics, and text pre-processing. The description of numerous techniques and their use in information retrieval is the focus. The proposed system analyzes different parameters such as reading time, ease of reading, readability score, no of paragraphs, avg words per paragraph, total sentences in longest paragraph, avg words per sentence, longest sentence, words in longest sentence, frequency of “and” word, compulsive hedgers, intensifiers, vague words. All this processing will be done using NLP. The paper's conclusion discusses the value of textual analysis for information retrieval as well as its potential moving forward.
Article
لقد أسهم الحاسب الآلي بأدوار متعددة في تحسين مخرجات دارسي الحديث النبوي، وذلك من خلال توسيع دائرة البحث، والحصول على المعلومات صعبة الوصول إليها بسرعة وكفاءة، كما أوجد أنماطًا من التعلم، وربط العالم الإلكتروني بالورقي، وأحدث تراكيب بحثية لم تكن معروفة، وفي الوقت ذاته فتح آفاقًا من استخراج الأبحاث العلمية؛ مَهَّدَ ذلك دراسة بعض المقررات التقنية في تخصصات الحديث النبوي كمُقرر "التَّقانَة في خدمة السُّنَّة"، والذي تبنته كلية الحديث الشريف بالجامعة الإسلامية، في مرحلة البكالوريوس. وسيبرز هذا البحث– بعون الله- طرفًا هذه الأمور التي حققت تجويد مخرجات التعلم، بما يفتح الآفاق أمام الدارسين لمشاريع تقنية حديثية، وذلك من خلال قياس أثر استخدام التقنية الحاسوبية على مخرجات العملية التعليمية من جهة، وإلى تسليط الضوء بشكل أكبر على استخدامات هذه التقنيات لخدمة البحث العلمي في مجال السنة النبوية المطهرة وأهمها طرق استنباط الأفكار البحثية بهذه الحوسبة. لتحقيق هذه التوجهات قمنا بمجموعة من الدراسات التجريبية والتجارب الاحصائية والتي أعطت نتائج ملحوظة في تطوير أفكار بحثية جديدة وتحسين مخرجات التعلم.
Article
Analytics research includes the field of sentiment analysis. To make sense of this, computational methods can be used to read raw data. Analysis is what this is. Written expression that is either positive, negative, or neutral can be assessed using sentiment analysis. People use a variety of social media platforms, including Facebook, Twitter, etc. Machine learning algorithms can be effectively used to ascertain people's sentiments. Sentiment analysis is a field that has developed to automate the study of such data. Sentiment analysis aims to identify and extract human emotions from text. It seeks to find opinionated information on the Web and categorise it based on its polarity, or whether it has a positive or bad meaning. In contrast to conventional text-based analysis, it is a text-based analysis that helps to swiftly determine the customer's reaction.
Article
Full-text available
Online discussion forums are a valuable source of knowledge. Users may share or exchange ideas by posting content in the form of questions and answers. With the increasing volume of online content in the form of forums, finding relevant information in forums can be a challenging task and knowledge management and quality assurance of this content are of critical importance. Although online discussion forums offer search services, in most cases only keyword search is provided. In keyword search techniques, such as cosine similarity, lexical overlap between query and document terms is considered; however, these techniques do not consider the context or meaning of the terms, thus failed to retrieve the relevant documents. Earlier content-based research efforts for improving the performance of thread retrieval were primarily based on cosine similarity technique. Cosine similarity technique assigns term-weights based on term-frequency and inverse-document frequency; however, this technique does not consider discussion semantics which may lead to less effective document retrieval. To address these issues, we have proposed two thread ranking techniques for online discussion forums: (1) threads are ranked on the basis of a semantic similarity score between posts and (2) threads are ranked based on their participants’ reputation and posts’ quality. The proposed work provides a performance comparison between semantic similarity techniques and cosine similarity techniques along with reputation and post quality features in thread ranking process. Experimental results obtained using a real online forum dataset demonstrate that the proposed techniques have significantly improved thread ranking performance.
Conference Paper
Full-text available
State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures---one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers.
Article
Full-text available
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.
Article
Full-text available
Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.
Article
Full-text available
Hadiths are important textual sources of law, tradition, and teaching in the Islamic world. Analyzing the unique linguistic features of Hadiths (e.g. ancient Arabic language and story-like text) results to compile and utilize specific natural language processing methods. In the literature, no study is solely focused on Hadith from artificial intelligence perspective, while many new developments have been overlooked and need to be highlighted. Therefore, this review analyze all academic journal and conference publications that using two main methods of artificial intelligence for Hadith text: Hadith classification and mining. All Hadith relevant methods and algorithms from the literature are discussed and analyzed in terms of functionality, simplicity, F-score and accuracy. Using various different Hadith datasets makes a direct comparison between the evaluation results impossible. Therefore, we have re-implemented and evaluated the methods using a single dataset (i.e. 3150 Hadiths from Sahih Al-Bukhari book). The result of evaluation on the classification method reveals that neural networks classify the Hadith with 94 % accuracy. This is because neural networks are capable of handling complex (high dimensional) input data. The Hadith mining method that combines vector space model, Cosine similarity, and enriched queries obtains the best accuracy result (i.e. 88 %) among other re-evaluated Hadith mining methods. The most important aspect in Hadith mining methods is query expansion since the query must be fitted to the Hadith lingo. The lack of knowledge based methods is evident in Hadith classification and mining approaches and this absence can be covered in future works using knowledge graphs.
Article
Full-text available
Most state-of-the-art named entity recognition (NER) systems rely on the use of handcrafted features and on the output of other NLP tasks such as part-of-speech (POS) tagging and text chunking. In this work we propose a language-independent NER system that uses automatically learned features only. Our approach is based on the CharWNN deep neural network, which uses word-level and character-level representations (embeddings) to perform sequential classification. We perform an extensive number of experiments using two annotated corpora in two different languages: HAREM I corpus, which contains texts in Portuguese; and the SPA CoNLL-2002, which contains texts in Spanish. Our experimental results shade light on the contribution of neural character embeddings for NER. Moreover, we demonstrate that the same neural network which has been successfully applied for POS tagging can also achieve state-of-the-art results for language-independet NER, using the same hyper-parameters, and without any handcrafted features. For the HAREM I corpus, CharWNN outperforms the state-of-the-art system by 7.9 points in the F1-score for the total scenario (ten NE classes), and by 7.2 points in the F1 for the selective scenario (five NE classes).
Article
Full-text available
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific. Keywords Named Entity Recognition (NER), Natural Language processing (NLP), Hidden Markov Model (HMM).
Conference Paper
According to muhaddithun, hadith is 'what was transmitted on the authority of The Prophet (PBUH): his deeds, sayings, tacit approvals, or description of his physical features and moral behaviors'. These days, there are an increasing number of studies in Information Technology (IT) has been done on hadith domain in different levels of knowledge of hadith, where a number of studies have been conducted in IT to validate the hadith where most of them are based on the matching of test hadith with the authentic Ahadith in the database. However, there are limited computerized-based studies to authenticate the hadith based on scholars' principles. This paper discusses an analysis to produce a hierarchy with different levels of related studies in computational hadith to link with the computational authentication of isnad al-hadith science. The result from the analysis is the deepest level of hadith authentication where we presented the existing studies conducting hadith authentication based on principles of hadith authentication in hadith science. While the future work of the analysis is a computational authentication of isnad al-hadith based on commonly agreed principles in hadith.
Article
In this survey, we will discuss current uses of finite-state information in several statistical natural language processing tasks. To this end, we will review standard approaches in tokenization, part-of-speech tagging, and parsing, and illustrate the utility of finite-state information and technology in these areas. The particular problems were chosen to allow a natural progression from simple prediction to structured prediction. We aim for a sufficiently formal presentation suitable for readers with a background in automata theory that allows to appreciate the contribution of finite-state approaches, but we will not discuss practical issues outside the core ideas. We provide instructive examples and pointers into the relevant literature for all constructions. We close with an outlook on finite-state technology in statistical machine translation.