Conference PaperPDF Available

Query based information retrieval and knowledge extraction using Hadith datasets

December 2017

December 2017

DOI:10.1109/ICET.2017.8281714

Conference: 2017 13th International Conference on Emerging Technologies (ICET)

Authors:

Ahsan Mahmood

COMSATS University Islamabad

Hikmat U Khan

University of Sargodha

Zahoor Rehman

COMSATS Institute of information technology attok

Wahab Khan

University of Science & Technology Bannu

In Natural language processing, one of the fundamental tasks is Named Entity Recognition (NER) that include identifying names of peoples, locations and other entities. Applications of NER include catboats, speech recognition, machine translation, knowledge extraction and intelligent search systems. NER is an active research domain for the last 10 years. In this paper, we propose a knowledge extraction framework to extract Named entities from Sahih AlBukhari Urdu translation which is a world known Hadith book. The proposed framework is based on finite state transducer system to extract entities and process the Hadith content using Part of Speech (POS) tagging. Conditional Random Field, an ensemble based algorithm, processes the extracted nouns for NER and classification. In the future, we aim to implement the proposed framework to rank the hadith content and apply the Vector Space Model.

Hierarchy of Entities for detection as per Proposed Mode

…

Experimental Result of POS Tagging

…

Proposed Finite State Transducer model By using FST's we have developed an algorithm to extract all the entities from Hadith Dataset and saved each entity separately into the database. The model shown in Figure 1 have been used to extract Book Number, Book Name, Chapter Number, Chapter Name, Hadith Number, Hadith Sanad, and Matn from the Hadith structure. An example of a model that show all the components of Hadith in actual format present in the book and extracted model in the form of the pattern is shown in Figure 3. One thing to note here is that as we have only processed the Hadith in Urdu language so we have skipped the Arabic part i.e., Arabic Book Name, Chapter Name, and Hadith text from the extraction model and have only extracted the Urdu part of Hadith.

…

nformation Extraction Model of Hadith

…

Graphical Representation of POS Tagging

…

Figures - uploaded by Ahsan Mahmood

Content may be subject to copyright.

Content uploaded by Ahsan Mahmood

Content may be subject to copyright.

Query based information retrieval and knowledge

extraction using Hadith datasets

Ahsan Mahmood

, Hikmat Ullah Khan

, Zahoor-ur-Rehman

, Wahab Khan

Computer Science Dept.

COMSATS Institute of Information Technology

Attock, Pakistan

Computer Science Dept.

COMSATS Institute of Information Technology

Wah, Pakistan

Department of Computer Science and Software Engineering,

International Islamic University, Islamabad 44000, Pakistan

Ahsan_mahmood_awan@yahoo.com, Hikmat.ullah@ciitwah.edu.pk, xahoor@ciit-attock.edu.pk, wahab.phdcs72@iiu.edu.pk

Abstract- In Natural language processing, one of the

fundamental tasks is Named Entity Recognition (NER) that

include identifying names of peoples, locations and other

entities. Applications of NER include catboats, speech

recognition, machine translation, knowledge extraction and

intelligent search systems. NER is an active research domain

for the last 10 years. In this paper, we propose a knowledge

extraction framework to extract Named entities from Sahih

AlBukhari Urdu translation which is a world known Hadith

book. The proposed framework is based on finite state

transducer system to extract entities and process the Hadith

content using Part of Speech (POS) tagging. Conditional

Random Field, an ensemble based algorithm, processes the

extracted nouns for NER and classification. In the future, we

aim to implement the proposed framework to rank the hadith

content and apply the Vector Space Model.

Index Terms---Data Extraction, Hadith Information Retrieval,

Framework, Named Entity Recognition.

I- INTRODUCTION

In the current era of Information Technology, A major part

of the data is in textual form and this data is increasing at a

very rapid rate because web users are shifting towards social

media where they share new textual data on daily basis [1].

This data is available in multiple formats at different places,

including news, social media, blogs, Wikipedia, articles, and

other places and is also available in multiple languages. This

huge increase in textual data on world wide web mostly

came after the birth of social media where millions of

people share their textual contents in different languages[2],

[3]. In computer human interaction, understanding the

natural language is one major task that is achieved though

the help of NLP [4]. Although to completely understand and

generate the natural language, first it is necessary to perform

the basic NLP like syntax analysis of language that includes

understanding of language morphology, part-of-speech

tagging, parsing, stemming etc. then perform the complex

part of NLP in the form of Named Entity Recognition

(NER), lexical semantics, sentiment analysis, machine

translation, natural language understand, natural language

generation, etc. NER is one of the challenging task in NLP

due to the unavailability or limitation of supervised training

data [5]. Another thing that make the NER task challenging

is the morphological richness[6]. NER involves extraction

of information like names of persons, locations,

organizations, etc. NER is very essential in NLP specially

when we want to extract the underlying knowledge from the

textual data as most of the important entities lies under

Named Entities (NE).In order to achieve the task, NER rely

on other NLP tasks such as Stemming, POS tagging and

morphological analysis[7]. These tasks are applied over the

textual data one by one in form of a framework to extracted

the underlying information from the data that leads to the

knowledge extraction.

In the process of knowledge discovery from the databases,

Islamic knowledge sources such as the Quran and Hadith

have high significance as huge amount of knowledge is

present in these 2 sources. We need to understand the need

for extracting the Islamic knowledge and the role of

Artificial Intelligence and DM and how we apply diverse

methods to extract the relevant knowledge from the Islamic

data. What we wanted to achieve is to access the underlying

knowledge in Islamic sources that will enable the user to

perform intelligent queries, searches and relational queries

over the knowledge system. As the data all around the world

is increasing at a very rapid rate we must extract the

knowledge of Islamic resources to distinguish between the

true and fabricated knowledge.

In this paper we approach NLP through Sahih Bukhari Urdu

translation book and our ultimate objective is to build an

Information Retrieval (IR) System in which users can query

and get the best results. Urdu is a morphologically rich

language and similar to Arabic and has its roots in Arabic

and Persian. Due to lack of capital words and morphological

complexities NLP in Urdu is challenging [8]. We will start

with a model based on Finite State Transducer (FST) to

extract entities of hadith and from extracted entities we will

process the textual part (Sanad and Matn) of hadith further

and tokenize the data as well as remove the stop words. The

tokenized words will then be used for part of speech tagging

through the help of Conditional Random Fields (CRF). In

tagged data Nouns will be processed further as most of the

important part of the data is present in noun so noun will be

used to extract NE’s from the tagged dataset. After

extraction of NE’s we will detect multi-words and divide

the extracted NE’s into different classified groups.

The paper is divided into following parts: Section-II

presents the background work related to this paper, Section-

III contains related Work; Section-IV describes our

proposed methodology divided into different parts.

II- BACKGROUND

Let us share our existing work in which we have developed

a Hadith Dataset Repository prepared after extracting data

from different hadith books in different languages including

English, Urdu and Arabic. The Dataset Repository is freely

available for research purposes. In our earlier work, we

selected different authenticated sources of hadith dataset of

different hadith books, including Sahih Muslim, Sahih Al-

Bukhara, Sunnah Abu Dawood, Mawta Imam Malik, etc.

Data was extracted by crawling different websites and

collecting the required data. In each book, the Hadiths are

managed in different formats including chapters, sections,

books, etc. We had to use different algorithms for different

books in order to extract all the components of that book.

As the data was divided into hierarchies we had to use

different algorithms to extract data from different books.

The algorithms were used along with the help of regular

expressions to parse each part of hadith separately into the

database.

III- RELATED WORK

Multiple systems for text mining have been created using

different techniques. NLP is one of the major fields of

application of the theory of automata.[9] Through NLP we

can retrieve information from textual data. In information

retrieval system one of the important task Is NER. NER is a

task that consists of identifying lexical units from the words,

it involves the detection of concrete entities and the type or

class to which the entity is referring to. In most of the NLP

systems, this has been an important information. [10] In the

field of NEE the Message Understanding Conference series

has one of the major forum for the researchers where the

researchers compare their work with each other in the field

of NER. The problem in NEE systems is that most of the

work done in this field is specific to English language or

LTR languages and in South Asian languages including

Arabic, Hindi, Urdu, etc. Some work has been done in this

field in the Arabic language has been presented as

ANERsys system that is built for Arabic textual data for

NER[11]. Mohamed et. Al proposed a hybrid approach for

NER in Arabic language. They used rule-based approach for

NER [12]. Shaalan k. presented a survey paper in which

they have discussed different kinds of work done in Arabic

NER and techniques used for this purpose[13].

Since the last decade, the researchers have been working in

NER for South Asian languages, using different techniques

and rule based systems. A system for NER in Punjabi

language have been developed using various types of rules

to achieve the tasks. The system gets good accuracy levels

from 82% to 89% [14]. Another system was built for NER

based on Hidden Markov Model which is language

independent and they developed the system for different

south Asian languages [15]. Another system was built for

NER in Nepali language using semi hybrid and rule based

approach. [16]

In recent years, south asian languages including Arabic,

Hindi, and Urdu have been under focus by researchers.

Urdu is the primary language of south Asia. Urdu script is

RTL same like Arabic and Persian language. [17]

International Joint Conference on Natural Language

Processing-08 is an important workshop that developed

NER systems for Indian and Urdu language. In this system

Statistical techniques or hybrid techniques to achieve NER

tasks.[18] In Urdu NLP different types of techniques were

used to achieve NEE. A model was built on the basis of

CRF or Conditional Random to achieve NER from urdu

language text.[19] later Mukand S. proposed another four

stages model based on CRF for Urdu language that achieved

the F-score of 68.9% [20]. Text processing applications

requires to recognize all the entities like names, numbers,

organizations, dates etc. and recognizing these entities

means a vital goal has been achieved in this field. [21]

According to Daud, due to the lack of capitalization features

in the Urdu language NER is harder in the Urdu language.

[8]

Hadith data have been used by many researchers for NLP

and NER. Harrag et. Al. [22] performed the Named entity

extraction from Hadith data that and his work include

extracting the lexical units in words, and other kinds of solid

entities Another System has been presented for extraction

of narrators from Hadith dataset through NER methods.[23]

Another work is presented on the extraction of the surface

information from hadith. They detected the entities and

words that contains knowledge from Hadith texts and their

system was built on Finite state transducers[24]. Ibrahim et

al [25] presented their work on hadith data in which they in

which they performed an analysis of Isnad from hadith text

and used the Isnad data for analysis of Hadith. Saloot et al

[26], presented their paper in which they have compared

different kind of NLP tasks applied over hadith data in

different languages. Our previous work also include work

on Islamic sources like Holy Quran. Previously we

proposed a system through which user can perform

Semantic search. The system was based on Ontology [27]

and another system for the same purpose based on WordNet

model [28]

IV- PROPOSED METHODOLGY

The aim of this research is to propose an Information

Retrieval System so users can query and get their results

according to the query. Some of the most important steps of

this research include, Extraction of Entities, Creation of

tokens, removal of stop words, part of speech tagging of

tokens, use tagged data for NER, Multi-word detection from

NER, and the creation of IR system from NER data. The

proposed hierarchy of entities to be extracted is shown in

Figure 1. We will explain each part of the process in details

in the next sections.

Figure 1 Hierarchy of Entities for detection as per Proposed Mode

A- Corpus of Sahih Al-bukhari

We have found the Sahih Bukhari Urdu translation complete

book from islamicurdubooks.com

. The book is available in

MS. Word format and divided into 7 separate files. Each file

contains about 1000 Hadiths and when combined the

complete book contains 7563 Hadiths. There are 3

hierarchal levels of the book. These are books, chapters in

books, and hadiths in chapters. Each hadith has multiple

parts that will be discussed in the next section.

B- Entity Extraction from Hadith

In order to process the book, it was required to extract each

and every Hadith data with reference to its Name of Book,

Book No, Name of Chapter, Chapter No, Hadith No, and

www.islamicurdubooks.com/download

then the information extraction from the Hadith Sanad &

Hadith Matn. For this purpose, we have used Finite State

Transducers. FST is most commonly used tool for Natural

language processing. An FST is a machine with basically 6

parts, it’s basically is a sequential machine with multiple

outputs possible. Its parts consist of:

• Ȉ denotes Input set

• ī denotes Output Set

• Q denotes Total States

• I denote initial States

• F denotes final states

• ǻ denotes transition relation

An example of proposed FST model is shown in

Figure 2

Figure 2 Proposed Finite State Transducer model

By using FST’s we have developed an algorithm to extract

all the entities from Hadith Dataset and saved each entity

separately into the database. The model shown in Figure 1

have been used to extract Book Number, Book Name,

Chapter Number, Chapter Name, Hadith Number, Hadith

Sanad, and Matn from the Hadith structure. An example of a

model that show all the components of Hadith in actual

format present in the book and extracted model in the form

of the pattern is shown in Figure 3. One thing to note here is

that as we have only processed the Hadith in Urdu language

so we have skipped the Arabic part i.e., Arabic Book Name,

Chapter Name, and Hadith text from the extraction model

and have only extracted the Urdu part of Hadith.

Figure 3 Information Extraction Model of Hadith

C- Part of Speech (POS) Tagging

After we have successfully extracted all the entities from

each of the hadith, the next step was to tag the data on the

basis of POS. For POS tag assignment we used CRF [29]

model. For CRF evaluation we make legal use of the Urdu

Digest POS Tagged dataset released by the Center for

Language Engineering (CLE dataset

) for research and

computational processing in Urdu. The CLE dataset

contains 100 K Urdu words from various areas.

Our approach applied the CRF approach for part of speech

tag labeling and prediction. In particular, we used the c-

sharp based open source libraries CRFSharp

., CRFSharp is

a new CRF package and we believe that it is more flexible

than the current advanced systems.

For CRF model performance testing we conducted tenfold

cross validation experiments on CLE dataset, CRF achieved

very good results as shown in Table 1.

Table 1 Experimental Result of POS Tagging

Type Result

Precision 96.44%

Recall 88.77%

F-Measure 92.41%

http://www.cle.org.pk/ accessed on 01-07-2017

https://github.com/zhongkaifu/CRFSharp accessed on 01-

07-2017

The features used in our 10-fold cross validation are given in

below as follows:

1. Previous lexical word

2. Current lexical word

3. Next lexical word

4. Current lexical word + Previous lexical word

5. Current lexical word + Next lexical word

After checking the CRF model performance we again

trained CRF model on the whole CLE dataset and supplied

it our data as tested data. The graphical representation of our

work of tagging is shown in Figure 4.

Figure 4 Graphical Representation of POS Tagging

An Example of our Tagged Hadith data is shown in Figure

5. In which we have used the hadith shown in Figure 3.

Figure 5 Hadith Sanad And Matn Tagged Example

D- Named Entity Extraction and Multiword Detection

After POS tagging was applied on the data the next step was

to extract the nouns from Sanad and Matn tagged

documents as shown in Figure 1. For this purpose, we

extract the information of Nouns from Sanad and Matn data.

This data of Nouns will be processed further in future to

extract named entities and multi-word detection. In named

entities extraction we will classify the extracted named

entities into different classes as given in the bottom of

Error! Reference source not found.

E- Proposed Information Retrieval System as Future Plan

For information retrieval purposes against user queries,

Vector space model (VSM) will be used. The VSM is an

algebraic model. It is used to represent text documents in

terms of vectors. It’s one of the major model use in

information retrieval. The model is also called term

frequency, inverse document frequency model. Our

Proposed IR system is shown in

Figure 6

Figure 6 Infor matio n Retrie val S ystem

Conclusion

In this paper our ultimate task is to create an Information

Retrieval System Based on the Hadith Data of Sahih

Bukhari. For this purpose, we have used different

algorithms and processes to achieve our target. Our work

includes applying FST on the Hadith data to extract entities,

CRF based tagger to tag the Sanad and Matn part of Hadith

and on tagged data we applied NER and classified those

named entities into different classes.

Our future work consists extracting Named Entities from the

Tagged document, Extraction of Multi-word detection from

those extracted named entities, and applying Vector space

model (VSM) on the data and build an Information

Retrieval System which will let the user query against our

model. We will also work further on our FST and POS

Tagging model to further improve our processing results.

EFERENCES

[1] R. Khan, H. U. Khan, M. S. Faisal, K. Iqbal, and M.

S. I. Malik, “An Analysis of Twitter users of

Pakistan,” Int. J. Comput. Sci. Inf. Secur., vol. 14, no.

8, p. 855, 2016.

[2] E. Cambria and B. White, “Jumping NLP Curves: A

Review of Natural Language Processing Research

[Review Article],” IEEE Comput. Intell. Mag., vol. 9,

no. 2, pp. 48–57, May 2014.

[3] C. M. S. Faisal, A. Daud, F. Imran, and S. Rho, “A

novel framework for social web forums’ thread

ranking based on semantics and post quality features,”

J. Supercomput., vol. 72, no. 11, pp. 4276–4295, Nov.

2016.

[4] G. Luo, X. Huang, C.-Y. Lin, and Z. Nie, “Joint

named entity recognition and disambiguation,” in

Proc. EMNLP, 2015, pp. 879–880.

[5] G. Lample, M. Ballesteros, S. Subramanian, K.

Kawakami, and C. Dyer, “Neural Architectures for

Named Entity Recognition,” ArXiv160301360 Cs,

Mar. 2016.

[6] M. Oudah and K. Shaalan, “NERA 2.0: Improving

coverage and performance of rule-based named entity

recognition for Arabic,” Nat. Lang. Eng., vol. 23, no.

3, pp. 441–472, May 2017.

[7] C. N. dos Santos and V. Guimarães, “Boosting

Named Entity Recognition with Neural Character

Embeddings,” ArXiv150505008 Cs, May 2015.

[8] A. Daud, W. Khan, and D. Che, “Urdu language

processing: a survey,” Artif. Intell. Rev., vol. 47, no.

3, pp. 279–311, Mar. 2017.

[9] A. Maletti, “Survey: Finite-state technology in natural

language processing,” Theor. Comput. Sci., vol. 679,

pp. 2–17, May 2017.

[10] M. Padró and L. Padró, “A named entity recognition

system based on a finite automata acquisition

algorithm,” Proces. Leng. Nat., vol. 35, pp. 319–326,

2005.

[11] Y. Benajiba, P. Rosso, and J. M. BenedíRuiz,

“ANERsys: An Arabic Named Entity Recognition

System Based on Maximum Entropy,” in

Computational Linguistics and Intelligent Text

Processing, 2007, pp. 143–153.

[12] M. A. Meselhi, H. M. A. Bakr, I. Ziedan, and K.

Shaalan, “A Novel Hybrid Approach to Arabic

Named Entity Recognition,” in Machine Translation,

vol. 493, X. Shi and Y. Chen, Eds. Berlin,

Heidelberg: Springer Berlin Heidelberg, 2014, pp.

93–103.

[13] K. Shaalan, “A Survey of Arabic Named Entity

Recognition and Classification,” Comput. Linguist.,

vol. 40, no. 2, pp. 469–510, Jun. 2014.

[14] G. Vishal and S. L. Gurpreet, “Named Entity

Recognition for Punjabi Language Text

Summarization,” Int. J. Comput. Appl., vol. 33, no. 3,

pp. 28–32, Nov. 2011.

[15] S. Morwal, N. Jahan, and D. Chopra, “Named Entity

Recognition using Hidden Markov Model (HMM),”

ResearchGate, vol. 1, no. 4, pp. 15–23, Dec. 2012.

[16] A. Dey, A. Paul, and B. Purkayastha, “Named Entity

Recognition for Nepali language: A Semi Hybrid

Approach,” Int. J. Eng. Innov. Technol., vol. 3, 2014.

[17] W. Anwar, X. Wang, and X. l Wang, “A Survey of

Automatic Urdu Language Processing,” in 2006

International Conference on Machine Learning and

Cybernetics, 2006, pp. 4489–4494.

[18] U. Singh, V. Goyal, and G. S. Lehal, “Named Entity

Recognition System for Urdu.,” in COLING, 2012,

pp. 2507–2518.

[19] A. Ekbal and S. Bandyopadhyay, “Named entity

recognition in Bengali using system combination,”

Lingvisticae Investig., vol. 37, no. 1, pp. 1–22, Jan.

2014.

[20] S. Mukund and R. K. Srihari, “NE Tagging for Urdu

Based on Bootstrap POS Learning,” in Proceedings of

the Third International Workshop on Cross Lingual

Information Access: Addressing the Information Need

of Multilingual Societies, Stroudsburg, PA, USA,

2009, pp. 61–69.

[21] K. Riaz, “Rule-based Named Entity Recognition in

Urdu,” in Proceedings of the 2010 Named Entities

Workshop, Stroudsburg, PA, USA, 2010, pp. 126–

135.

[22] F. Harrag, E. El-Qawasmeh, and A. M. S. Al-Salman,

“Extracting Named Entities from Prophetic Narration

Texts (Hadith),” in Software Engineering and

Computer Systems, 2011, pp. 289–297.

[23] A. Muazzam, Siddiqui, E.-S. Mostafa, A. Saleh, and

Bagais, “Extraction and Visualization of the Chain of

Narrators from Hadiths using Named Entity

Recognition and Classification,” ResearchGate, vol.

5, no. 1, Apr. 2014.

[24] F. Harrag, “Text mining approach for knowledge

extraction in Sahîh Al-Bukhari,” Comput. Hum.

Behav., vol. 30, pp. 558–566, Jan. 2014.

[25] N. K. Ibrahim, M. F. Noordin, S. Samsuri, M. S. A.

Seman, and A. E. B. Ali, “Isnad Al-Hadith

Computational Authentication: An Analysis

Hierarchically,” in 2016 6th International Conference

on Information and Communication Technology for

The Muslim World (ICT4M), 2016, pp. 344–348.

[26] M. A. Saloot, N. Idris, R. Mahmud, S. Ja’afar, D.

Thorleuchter, and A. Gani, “Hadith data mining and

classification: a comparative analysis,” Artif. Intell.

Rev., vol. 46, no. 1, pp. 113–128, Jun. 2016.

[27] H. U. Khan, S. M. Saqlain, M. Shoaib, and M. Sher,

“Ontology Based Semantic Search in Holy Quran,”

Int. J. Future Comput. Commun., vol. 2, no. 6, pp.

570–575, 2013.

[28] M. Shoaib, M. N. Yasin, U. K. Hikmat, M. I. Saeed,

and M. S. H. Khiyal, “Relational WordNet model for

semantic search in Holy Quran,” in 2009

International Conference on Emerging Technologies,

2009, pp. 29–34.

[29] J. Lafferty, A. McCallum, and F. Pereira,

“Conditional Random Fields: Probabilistic Models for

Segmenting and Labeling Sequence Data,” Dep. Pap.

CIS, Jun. 2001.

Tagging Algorithm and POS Tags for Narrator's Name in Hadith Document

Conference Paper

Full-text available

Sep 2023

Automatic Query Generation Based on Adaptive Naked Mole-Rate Algorithm

Article

Full-text available

Jun 2024
MULTIMED TOOLS APPL

In the growing information retrieval (IR) world, selecting suitable keywords and generating queries is important for effective retrieval. Modern database applications need a sophisticated interface for automatically updating the connections between users and databases. Most database applications are intelligent; however, some may be complex to understand when generating queries for effective retrieval. Therefore, this paper develops a new Adaptive Naked Mole-Rate algorithm (ANMR) for an automatic query generation (AQG) based IR system. The query generation approach primarily generates the query with expanded keywords to enhance IR. Modified sealion optimization (MSO) is applied to select several features. The selected features are combined in the feature fusion process using the Fuzzy based Search and Rescue approach (FSR). Similarity matching is performed using the hybrid cosine and Jaccard similarity measures. At last, the ranking process is performed using the Dynamic Global Local Attention Network based on Capsules (DGLANC). The developed AQG-ANMR system improves the performance of the entire information retrieval system. Then, to analyze the performance, the proposed approach is implemented in the python platform and evaluated in terms of accuracy, recall, precision, and F1-Score by employing TREC-3 and CISI datasets. Besides, the performance of the proposed approach is compared with that of state-of-the-art approaches. Finally, the simulated results clearly showed that the proposed approach outperformed the state-of-the-art approaches better. The maximum accuracy attained by the proposed approach is 97.2% with AQG-ANMR and 94.69% without AQG-ANMR for the TREC 3 dataset. In the same way, using the CISI dataset, the proposed approach reached a maximum accuracy of 98.6% with AQG-ANMR and 95.8% without AQG-ANMR, respectively.

A Comprehensive Review of Sanad-Based Studies in Hadith Literature

Conference Paper

Full-text available

Nov 2023

This paper presents a comprehensive review of the recent Sanad-based studies in Hadith literature. The review is organized into four main categories: automated Hadith classification for authenticity and reliability assessment, building and analysing Hadith narrator networks, identifying and extracting key components from Hadith text (Sanad and names), and construction and development of Sanad datasets and ontologies. Our review examines various methods used in automated Hadith classification, including expert systems, data mining, and machine learning, discussing their limitations. The analysis of Hadith narrator networks using techniques like social network analysis and graph theory provides insights into authenticity and credibility. Natural Language Processing techniques like Hidden Markov Models and Word Sense Disambiguation are employed for extracting key components from Hadith text. The construction of Sanad datasets and ontologies is discussed as valuable resources for facilitating information retrieval and analysis. This review summarizes the key findings and limitations of the studies in each category. It highlights the potential of Sanad-based studies for future research in Hadith studies and emphasizes the need for further refinement and validation of the proposed methods. Keywords: Sanad, Transmission Chain, Hadith, Sanad Analysis, Hadith Ontology, Dataset

Query-based information retrieval system adopting whale manhattan optimization-based deep belief neural network

Article

Full-text available

May 2024
MULTIMED TOOLS APPL

Deepak Dahiya

The Information Retrieval system aims to discover relevant documents and display them as query responses. However, the ever-changing nature of user queries poses a substantial research problem in defining the necessary data to respond accurately. The Major intention for this study is for enhance the retrieval of relevant information in response to user queries. The aim to develop an advanced IR system that adapts to changing user requirements. By introducing WMO_DBN, we seek to improve the efficiency and accuracy of information retrieval, catering to both general and specific user searches. The proposed methodology comprises three important steps: pre-processing, feature choice, and categorization. Initially, unstructured data subject to pre-processing to transform it into a structured format. Subsequently, relevant features are selected to optimize the retrieval process. The final step involves the utilization of WMO_DBN, a novel deep learning model designed for information retrieval based on the query data. Additionally, similarity calculation is employed to improve the effectiveness for the network training model. The investigational evaluation for the suggested model was conducted, and its performance is measured regarding the metrics of recall, precision, accuracy, and F1 score, the present discourse concerns their significance within the academic realm. The results prove the superiority of WMO_DBN in retrieving relevant information compared to traditional approaches. This research introduces novel method for addressing the challenges in information retrieval with the integration of WMO_DBN. By applying pre-processing, feature selection, and a deep belief neural network, the proposed system achieves more accurate and efficient retrieval of relevant information. The study contributes to the advancement of information retrieval systems and emphasizes the importance of adapting to users' evolving search queries. The success of WMO_DBN in retrieving relevant information highlights its potential for enhancing information retrieval processes in various applications.

Granular Syntax Processing with Multi-task and Curriculum Learning

Preprint

Full-text available

Nov 2023

Syntactic processing techniques are the foundation of Natural Language Processing (NLP), supporting many downstream NLP tasks. In this paper, we conduct pair-wise Multi-Task Learning (MTL) on syntactic tasks with different granularity, namely Sentence Boundary Detection (SBD), text chunking, and Part-of-Speech(PoS) tagging, so as to investigate the extent to which they complement each other. We propose a novel soft parameter sharing mechanism to share local and global dependency information that is learned from both target tasks. We also propose a Curriculum Learning (CL) mechanism to improve MTL with non-parallel labeled data. Using non-parallel labeled data in MTL is a common practice, whereas it has not received enough attention before. For example, our employed PoS tagging data do not have text chunking labels. When learning PoS tagging and text chunking together, the proposed CL mechanism aims to select complementary samples from the two tasks to update the parameters of the MTL model in the same training batch. Such a method yields better performance and learning stability. We conclude that the fine-grained tasks can provide complementary features to coarse-grained ones, while the most coarse-grained task, SBD, provides useful information for the most fine-grained one, PoS tagging. Additionally, the text chunking task achieves state-of-the-art performance when joint learning with PoS tagging. Our analytical experiments also show the effectiveness of the proposed soft parameter sharing and CL mechanisms.

A Review on Text Analysis Using NLP

Chapter

Jun 2024

Text Analysis for Information Retrieval Using NLP

Chapter

Jan 2024

The application of natural language processing (NLP) methods in textual analysis for information retrieval is examined in this work. In this research, an overview of the significance and function of text processing in information retrieval comes first, then comes text pre-processing techniques such as stop-word deletion, stemming, and tokenization. In addition, various NLP methods, including sentiment, component identification, and named entity recognition, are being researched. This research then examines various text representation methods, including language models, TF-IDF, and bag-of-words. The act of analyzing disorganized text and turning it into valuable data for analysis to get a quantifiable figure that contains some essential information is known as “text analytics.” Businesses are using text analysis more and more frequently. It aids in the analysis of unstructured data, such as customer reviews, as well as the discovery of patterns and the forecasting of trends. The technologies for text analysis that are offered for transforming text information into useful data for analysis include systems, libraries, analysis, automated process programmers, data collection, and extraction-based tools, to name just a few. The fundamentals of textual data, various text mining approaches, and the most widely used text analysis tools will all be covered in this research. We look at Naive Bayes, deep learning, and support machines for text classification. This research examines the use of natural language processing (NLP) techniques in textual analysis for information retrieval. The research covers information retrieval systems, natural language processing (NLP) methods, captioning, phrase-based systems, text clustering, text similarity metrics, and text pre-processing. The description of numerous techniques and their use in information retrieval is the focus. The proposed system analyzes different parameters such as reading time, ease of reading, readability score, no of paragraphs, avg words per paragraph, total sentences in longest paragraph, avg words per sentence, longest sentence, words in longest sentence, frequency of “and” word, compulsive hedgers, intensifiers, vague words. All this processing will be done using NLP. The paper's conclusion discusses the value of textual analysis for information retrieval as well as its potential moving forward.

أثر علم الحاسب الآلي في تطوير مخرجات تعلم الحديث النبوي

Article

Jun 2023

لقد أسهم الحاسب الآلي بأدوار متعددة في تحسين مخرجات دارسي الحديث النبوي، وذلك من خلال توسيع دائرة البحث، والحصول على المعلومات صعبة الوصول إليها بسرعة وكفاءة، كما أوجد أنماطًا من التعلم، وربط العالم الإلكتروني بالورقي، وأحدث تراكيب بحثية لم تكن معروفة، وفي الوقت ذاته فتح آفاقًا من استخراج الأبحاث العلمية؛ مَهَّدَ ذلك دراسة بعض المقررات التقنية في تخصصات الحديث النبوي كمُقرر "التَّقانَة في خدمة السُّنَّة"، والذي تبنته كلية الحديث الشريف بالجامعة الإسلامية، في مرحلة البكالوريوس. وسيبرز هذا البحث– بعون الله- طرفًا هذه الأمور التي حققت تجويد مخرجات التعلم، بما يفتح الآفاق أمام الدارسين لمشاريع تقنية حديثية، وذلك من خلال قياس أثر استخدام التقنية الحاسوبية على مخرجات العملية التعليمية من جهة، وإلى تسليط الضوء بشكل أكبر على استخدامات هذه التقنيات لخدمة البحث العلمي في مجال السنة النبوية المطهرة وأهمها طرق استنباط الأفكار البحثية بهذه الحوسبة. لتحقيق هذه التوجهات قمنا بمجموعة من الدراسات التجريبية والتجارب الاحصائية والتي أعطت نتائج ملحوظة في تطوير أفكار بحثية جديدة وتحسين مخرجات التعلم.

Sentiment Analysis using Machine Learning Techniques

Article

Apr 2023

Analytics research includes the field of sentiment analysis. To make sense of this, computational methods can be used to read raw data. Analysis is what this is. Written expression that is either positive, negative, or neutral can be assessed using sentiment analysis. People use a variety of social media platforms, including Facebook, Twitter, etc. Machine learning algorithms can be effectively used to ascertain people's sentiments. Sentiment analysis is a field that has developed to automate the study of such data. Sentiment analysis aims to identify and extract human emotions from text. It seeks to find opinionated information on the Web and categorise it based on its polarity, or whether it has a positive or bad meaning. In contrast to conventional text-based analysis, it is a text-based analysis that helps to swiftly determine the customer's reaction.

Improve Firefly Heuristic Optimization Scheme for Web based Information Retrieval

Conference Paper

Feb 2023

A novel framework for social web forums’ thread ranking based on semantics and post quality features

Article

Full-text available

Nov 2016
J SUPERCOMPUT

Online discussion forums are a valuable source of knowledge. Users may share or exchange ideas by posting content in the form of questions and answers. With the increasing volume of online content in the form of forums, finding relevant information in forums can be a challenging task and knowledge management and quality assurance of this content are of critical importance. Although online discussion forums offer search services, in most cases only keyword search is provided. In keyword search techniques, such as cosine similarity, lexical overlap between query and document terms is considered; however, these techniques do not consider the context or meaning of the terms, thus failed to retrieve the relevant documents. Earlier content-based research efforts for improving the performance of thread retrieval were primarily based on cosine similarity technique. Cosine similarity technique assigns term-weights based on term-frequency and inverse-document frequency; however, this technique does not consider discussion semantics which may lead to less effective document retrieval. To address these issues, we have proposed two thread ranking techniques for online discussion forums: (1) threads are ranked on the basis of a semantic similarity score between posts and (2) threads are ranked based on their participants’ reputation and posts’ quality. The proposed work provides a performance comparison between semantic similarity techniques and cosine similarity techniques along with reputation and post quality features in thread ranking process. Experimental results obtained using a real online forum dataset demonstrate that the proposed techniques have significantly improved thread ranking performance.

Neural Architectures for Named Entity Recognition

Conference Paper

Full-text available

Mar 2016

State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures---one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers.

Urdu language processing: a survey

Article

Full-text available

Mar 2017
ARTIF INTELL REV

Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.

NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic

Article

Full-text available

May 2016

Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.

Hadith data mining and classification: a comparative analysis

Article

Full-text available

Jun 2016
ARTIF INTELL REV

Hadiths are important textual sources of law, tradition, and teaching in the Islamic world. Analyzing the unique linguistic features of Hadiths (e.g. ancient Arabic language and story-like text) results to compile and utilize specific natural language processing methods. In the literature, no study is solely focused on Hadith from artificial intelligence perspective, while many new developments have been overlooked and need to be highlighted. Therefore, this review analyze all academic journal and conference publications that using two main methods of artificial intelligence for Hadith text: Hadith classification and mining. All Hadith relevant methods and algorithms from the literature are discussed and analyzed in terms of functionality, simplicity, F-score and accuracy. Using various different Hadith datasets makes a direct comparison between the evaluation results impossible. Therefore, we have re-implemented and evaluated the methods using a single dataset (i.e. 3150 Hadiths from Sahih Al-Bukhari book). The result of evaluation on the classification method reveals that neural networks classify the Hadith with 94 % accuracy. This is because neural networks are capable of handling complex (high dimensional) input data. The Hadith mining method that combines vector space model, Cosine similarity, and enriched queries obtains the best accuracy result (i.e. 88 %) among other re-evaluated Hadith mining methods. The most important aspect in Hadith mining methods is query expansion since the query must be fitted to the Hadith lingo. The lack of knowledge based methods is evident in Hadith classification and mining approaches and this absence can be covered in future works using knowledge graphs.

Boosting Named Entity Recognition with Neural Character Embeddings

Article

Full-text available

May 2015

Most state-of-the-art named entity recognition (NER) systems rely on the use of handcrafted features and on the output of other NLP tasks such as part-of-speech (POS) tagging and text chunking. In this work we propose a language-independent NER system that uses automatically learned features only. Our approach is based on the CharWNN deep neural network, which uses word-level and character-level representations (embeddings) to perform sequential classification. We perform an extensive number of experiments using two annotated corpora in two different languages: HAREM I corpus, which contains texts in Portuguese; and the SPA CoNLL-2002, which contains texts in Spanish. Our experimental results shade light on the contribution of neural character embeddings for NER. Moreover, we demonstrate that the same neural network which has been successfully applied for POS tagging can also achieve state-of-the-art results for language-independet NER, using the same hyper-parameters, and without any handcrafted features. For the HAREM I corpus, CharWNN outperforms the state-of-the-art system by 7.9 points in the F1-score for the total scenario (ten NE classes), and by 7.2 points in the F1 for the selective scenario (five NE classes).

Named Entity Recognition using Hidden Markov Model (HMM)

Article

Full-text available

Dec 2012

Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific. Keywords Named Entity Recognition (NER), Natural Language processing (NLP), Hidden Markov Model (HMM).

Isnad Al-Hadith Computational Authentication: An Analysis Hierarchically

Conference Paper

Nov 2016

According to muhaddithun, hadith is 'what was transmitted on the authority of The Prophet (PBUH): his deeds, sayings, tacit approvals, or description of his physical features and moral behaviors'. These days, there are an increasing number of studies in Information Technology (IT) has been done on hadith domain in different levels of knowledge of hadith, where a number of studies have been conducted in IT to validate the hadith where most of them are based on the matching of test hadith with the authentic Ahadith in the database. However, there are limited computerized-based studies to authenticate the hadith based on scholars' principles. This paper discusses an analysis to produce a hierarchy with different levels of related studies in computational hadith to link with the computational authentication of isnad al-hadith science. The result from the analysis is the deepest level of hadith authentication where we presented the existing studies conducting hadith authentication based on principles of hadith authentication in hadith science. While the future work of the analysis is a computational authentication of isnad al-hadith based on commonly agreed principles in hadith.

Survey: Finite-State Technology in Natural Language Processing

Article

May 2016
THEOR COMPUT SCI

Andreas Maletti

In this survey, we will discuss current uses of finite-state information in several statistical natural language processing tasks. To this end, we will review standard approaches in tokenization, part-of-speech tagging, and parsing, and illustrate the utility of finite-state information and technology in these areas. The particular problems were chosen to allow a natural progression from simple prediction to structured prediction. We aim for a sufficiently formal presentation suitable for readers with a background in automata theory that allows to appreciate the contribution of finite-state approaches, but we will not discuss practical issues outside the core ideas. We provide instructive examples and pointers into the relevant literature for all constructions. We close with an outlook on finite-state technology in statistical machine translation.

Joint Entity Recognition and Disambiguation

Conference Paper

Jan 2015

Query based information retrieval and knowledge extraction using Hadith datasets

Abstract and Figures

Recommended publications

Part-of-speech tagger for Bodo language using deep learning approach

KEFST: a knowledge extraction framework using finite-state transducers

Urdu Named Entity Recognition Using Conditional Random Fields

Deep recurrent neural networks with word embeddings for Urdu named entity recognition

A Multilingual Datasets Repository of the Hadith Content