ArticlePDF Available

SOF: A semi-supervised ontology-learning-based focused crawler

August 2013
Concurrency and Computation Practice and Experience 25(12)

August 2013
25(12)

DOI:10.1002/cpe.2980

Authors:

Hai Dong

RMIT University

Farookh Khadeer Hussain

University of Technology Sydney

The rapid increase in the volume of data available on the Internet makes it increasingly impractical for a crawler to index the whole Web. Instead, many intelligent crawlers, known as ontology-based semantic focused crawlers, have been designed by making use of Semantic Web technologies for topic-centered Web information crawling. Ontologies, however, have constraints of validity and time, which may influence the performance of the crawlers. Ontology-learning-based focused crawlers are therefore designed to automatically evolve ontologies by integrating ontology learning technologies. Nevertheless, surveys indicate that the existing ontology-learning-based focused crawlers do not have the capability to automatically enrich the content of ontologies, which makes these crawlers unreliable in the open and heterogeneous Web environment. Hence, in this paper, we propose a framework for a novel semi-supervised ontology-learning-based focused (SOF) crawler, the SOF crawler, which embodies a series of schemas for ontology generation and Web information formatting, a semi-supervised ontology learning framework, and a hybrid Web page classification approach aggregated by a group of support vector machine models. A series of tests are implemented to evaluate the technical feasibility of this proposed framework. The conclusion and the future work are summarized in the final section. Copyright © 2012 John Wiley & Sons, Ltd.

System architecture and system workflow of the SOF crawler

…

Figures - uploaded by Hai Dong

Content may be subject to copyright.

Content uploaded by Hai Dong

Content may be subject to copyright.



SOF: A Semi-supervised Ontology-learning-based Focused Crawler 1

Hai Dong^, Farookh Khadeer Hussain*, and Elizabeth Chang^

^School of Information Systems, Curtin Business School, Curtin University of Technology, Perth, WA 6845, Australia

*School of Software, Faculty of Engineer and Information Technology, University of Technology, Sydney, Ultimo, NSW 2007,

Australia

Abstract—The dynamic innovation of Internet technologies drives the fast growth of data volume on the Web, which makes it

increasingly impractical for a crawler to index the whole Web. Instead, many intelligent crawlers, known as ontology-based semantic

focused crawlers, are designed, by making use of Semantic Web technologies for topic-centered Web information crawling.

Ontologies, however, have the constraints of validity and time, which may influence the performance of the crawlers. Ontology

learning-based focused crawlers are therefore designed to automatically evolve ontologies by integrating ontology learning

technologies. Nevertheless, the survey indicated that the existing ontology-learning-based focused crawlers do not have the capability

to automatically enrich the content of ontologies, which makes the crawlers unreliable in the open and heterogeneous Web

environment. Hence, in this paper, we proposed the framework of a novel semi-supervised ontology learning-based focused crawler –

the SOF crawler, which embodies a series of schemas for ontology generation and Web information formatting, a semi-supervised

ontology learning framework, and a hybrid Web page classification approach aggregated by a SVM model. A series of tests are

implemented to evaluate the technical feasibility of this proposed framework. The conclusion and the future work are summarized in

the final section.

Keywords—ontology-learning-based-focused crawler, ontological term learning, probabilistic model, semantic focused crawler,

semi-supervised ontology learning, semantic similarity model, support vector machine.

1. Introduction

With broadbandization of networks, popularization of multimedia broadcasting, and fusion of social networks and digital services, we

are living in the era of information explosion. According to a study2 conducted by IDC, in 2011, the amount of information created

and replicated surpassed 1.8 zettabytes (1.8 trillion gigabytes), which was 9 times as many as it of five years ago. Therefore, it is not

difficult to understand how hard to collect required information on the Web. Although popular search engines use bots, commonly

referred to as crawlers or spiders, to traverse the Web and index the Web information. However, due to the dynamic growth of the

Web, it is becoming increasingly impractical for crawlers to index the whole Web1. Instead, many intelligent crawlers, known as

topical/focused crawlers, are designed to find Web pages of a particular kind or on a particular topic, by avoiding hyperlinks that lead

to off-topic areas, and by concentrating on links to Web pages of interest 2, 3. Nevertheless, since the topical information used for

focused crawling is described by plain texts, there exists an ambiguity issue in the topical information, as a result of the nature of

natural languages. This issue may further result in the ambiguous crawling boundaries and the low precision of focused crawling. One

solution of this issue is to apply the background knowledge of crawling topics to focused crawling. Ontology, as a form of formal

representation of domain-specific knowledge, can be used in focused crawling to semantically define topical boundaries and enhance

the crawling precision. As ontologies are created by domain experts in terms of domain experts’ worldview, in order to represent the

current knowledge in a domain, two questions arise, which are, 1) does an ontology really reflect the knowledge in a domain in the

real world, and 2) how long can an ontology reflect the knowledge in a domain in the real world? With the consideration of the

validity and the time constraint of ontologies, several ontology-learning-based focused crawlers 4, 5 are proposed, in order to evolve

ontologies and keep their high performance in the focused crawling process. However, based on a survey in this research, it is found

that the existing methodologies in this area cannot guarantee their performance in an open and heterogeneous Web environment where

numerous unpredictable new terms emerge in Web pages, since the existing methodologies do not have the functionality of

ontological term learning.

Based on the above factor, in this paper, we propose the framework of a novel ontology-learning-based focused crawler – SOF crawler,

in order to realize the ontological term learning, and the high performance ontology-based focused crawling and Web page

classification, in an open and heterogeneous Web environment. By means of this crawler, crawling topics are represented by

ontological concepts, and metadata are built on Web pages in order to semantically describe their content. In addition, this crawler



1 This is a preprint version of the paper: Dong, H., Hussain, F.K.: SOF: A semi-supervised ontology-learning-based focused crawler. Concurrency and Computation:

Practice and Experience 25(12) (August 2013) pp. 1755-1770. Download link: http://onlinelibrary.wiley.com/doi/10.1002/cpe.2980/abstract

2 http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm



contains a semi-supervised ontology learning framework, enabling the continuous enrichment of the definitions of ontological

concepts and thus maintaining its performance in the focused crawling and Web page classification process in an open and

heterogeneous Web environment. In the semi-supervised ontology learning framework, a semantic similarity model and a probabilistic

model are designed to respectively measure the similarity between crawling topics and Web pages from different perspectives. A

Support Vector Machine (SVM) model is eventually trained to aggregate the results of the two models, in order to determine the

semantic relevance between crawling topics and Web pages.

The reminder of this paper is organized as follows. In Section 2, we briefly introduce the research areas of semantic focused crawling

and ontology-learning-based focused crawling, and review the previous work in the area of ontology learning-based focused crawling.

In Section 3 we introduce the system architecture and the functionalities of the components of the proposed SOF crawler, including

the general schemas for topical ontology building and Web page metadata generation, and the workflow of the semi-supervised

ontology learning and Web page classification process. In Section 4, we introduce the mathematical models employed in the semi-

supervised ontology learning and Web page classification process. In Section 5, we reveal the prototype implementation details of this

crawler, and evaluate its technical advantages by comparing its performance with the performance of the existing ontology-learning-

based focused crawlers. In Section 6, we summarize the technical features of this SOF crawler and draw our future research directions

in the area of ontology-learning-based focused crawling.

2. Related Works

In this section, we briefly introduce the research areas of semantic focused crawling and ontology-learning-based focused crawling,

and review the previous work in the area of ontology learning-based focused crawling.

A semantic focused crawler is a software agent that is able to traverse the Web, and retrieve as well as download related Web

information for specific topics, by means of semantic Web technologies 6, 7. The goal of semantic focused crawlers is to precisely and

efficiently retrieve and download relevant Web information by understanding the semantics underlying the Web information and the

semantics underlying the predefined topics. The semantic focused crawlers can briefly be classified into two clusters – the ontology-

based semantic focused crawlers and the non-ontology-based semantic focused crawlers, in terms of use of ontologies. The former

refers to the crawlers which make use of ontologies to represent the knowledge underlying topics and Web pages, and link the fetched

Web pages with semantically relevant ontological concepts, with the purpose of focused crawling and Web page classification 8-13.

The latter refers to the crawlers that make use of other Semantic Web technologies for focused crawling and Web page classification

14-17. According to a survey conducted by Dong et al. 18, it is found that most of the crawlers in this domain belong to the first cluster.

However, the limitation of the ontology-based semantic focused crawlers is that their crawling performance crucially depends on the

quality of ontologies. Furthermore, the quality of ontologies may be affected by two issues. The first issue is that, as it is well known

that an ontology is the formal representation of specific domain knowledge 19 and ontologies are designed by domain experts, there

may exist a gap between the domain experts’ understanding to the domain knowledge and the domain knowledge that exists in the real

world. The second issue is that, knowledge in the real world is dynamic and is in a persistently evolving process, compared with

relatively static ontologies. These two contradictory situations could lead to the problem that ontologies sometimes cannot precisely

represent the real-world knowledge, considering the issues of differentiation and dynamism. The reflection of this problem in the field

of semantic focused crawling is that the ontologies used by semantic focused crawlers cannot precisely represent the knowledge

revealed in Web information, since Web information is mostly created or updated by human users with different knowledge

understandings, and human users are efficient learners of new knowledge. This eventual consequence of this problem could be

reflected in the gradually descending curves in the performance of semantic focused crawlers.

In order to solve the defect of ontologies and maintain or enhance the performance of semantic focused crawlers, researchers start to

pay their attention to enhancing semantic focused crawling technologies by integrating them with ontology learning technologies. The

goal of ontology leaning is to semi-automatically extract facts or patterns from corpus or data and turn facts and patterns into machine-

readable ontologies 20. Various techniques have been designed for ontology learning, such as statistics-based techniques, linguistics (or

natural language processing)-based techniques, logic-based techniques, etc… These techniques can also be classified into supervised

techniques, semi-supervised techniques, and unsupervised techniques from the perspective of learning control. Obviously, ontology-

learning-based techniques can be used to solve the issue of semantic focused crawling, by learning new knowledge from crawled

documents and integrate the new knowledge with ontologies in order to persistently refine the ontologies.

In the rest of this section, we will review the existing works in the field of ontology learning-based semantic focused crawling. It is

found that few studies have been conducted in this field.



Zheng et al. 5 proposed a supervised ontology-learning-based focused crawler that aims to maintain the harvest rate of the crawler in

the crawling process. The main idea of this crawler is to construct an artificial neural network (ANN) model to determine the

relatedness between a Web page and an ontology. Given a domain-specific ontology and a topic represented by a concept in the

ontology, a set of relevant concepts are selected to represent the background knowledge about the topic, by counting the distance

between the topic concept and the other concepts in the ontology. The crawler then calculates the term frequencies of the relevant

concepts occurring in the visited Web pages. Next, the authors used the backpropagation algorithm to train a three-layer feedforward

ANN, the specification of which is shown in Table 1. The output of the ANN is the relevance score between the topic and a Web page.

The training process follows a supervised paradigm, by which the ANN is trained by labeled Web pages. The training will not stop

until the root mean square error (RMSE) is smaller than 0.01. The limitations of this approach are that, 1) it can only be used to

enhance the harvest rate of crawling but does not have the function of classification; 2) it cannot be used to evolve ontologies by

enriching the vocabulary of ontologies; and 3) the supervised learning may not work within an uncontrolled Web environment with

unpredictable new terms.

Input 1st layer Hidden layer Output layer

Frequency of relevant

concepts - xi Linear function yj = Wji xi Sigmoid transfer function



Sigmoid transfer function





Notion xi (i = 1…n) - a vector;

n - Number of relevant concepts in an ontology Wji (j = 1…4, i = 1…n) - a weight matrix

Table 1 Zheng et al.’s ANN model

Su et al. 4 proposed an unsupervised ontology-learning-based focused crawler in order to compute the relevance scores between topics

and Web pages. Given a specific domain ontology and a topic represented by a concept in this ontology, the relevance score between a

Web page and the topic is the weighted sum of the occurrence frequencies of all the concepts of the ontology in the Web page. The

original weight of each concept Ck is (,)

1.00 k

dC t

Wn , where n is a predefined discount factor, and d(Ck, t) is the distance between

the topic concept t and Ck. Next, this crawler makes use of reinforcement learning, which is a probabilistic framework for learning

optimal decision making from reward or punishment 21, in order to train the weight of each concept. The learning step follows an

unsupervised paradigm, which uses the crawler to download a number of Web pages and learn statistics based on these Web pages.

The learning step can be repeated many times. The weight of a concept Ck to a topic t in learning step m is mathematically expressed

as follows:

111

() (|)

() ( ) ()

kkkk

mmmm

kkkc

CCCC

kkt

Pt C Pt C n N

WWWW

Pt PC Pt n N









 (1)

where k

nis the number of Web pages in which Ck occurs, t

nis the number of Web pages in which Ck and t co-occurs, Nc is the total

number of Web pages crawled, and Nt is the number of Web pages in which t occurs. Compared with Zheng et al. 5’s approach, this

approach is able to classify Web pages by means of the concepts in an ontology, to learn the weights of relations between concepts,

and to work in an uncontrolled Web environment thanks to the unsupervised learning paradigm. The limitations of Su et al.’s approach

are that 1) it cannot be used to enrich the vocabulary of ontologies; 2) although the unsupervised learning paradigm can work in an

uncontrolled Web environment, it may not work well when numerous new terms emerge or when ontologies have limited vocabulary.

By means of a comparative analysis of the two ontology-based focused crawlers (Table 2), we found a common limitation, which is

that none of the two crawlers is able to really evolve ontologies by enriching their contents, namely their vocabularies. It is found that

both of the approaches attempt to use learning models to deduce the quantitative relationship between the occurrence frequencies of

the concepts in an ontology and the topic, which may not be applicable in the real Web environment. When numerous unpredictable

new terms outside the scope of the vocabulary of an ontology emerge in Web pages, these approaches cannot determine the

relatedness between the new terms and the topic, and cannot make use of the new terms for the relatedness determination, which could

result in the decline in their performance. Consequently, in order to address this research issue, we propose to design the SOF crawler,

in order to precisely discover, format and index relevant Web pages in the uncontrolled Web environment.

Zheng et al.’s crawler Su et al.’s crawler SOF Crawler

Learning paradigm Supervised Unsupervised Semi-supervised



Classification No Yes Yes

Term learning No No Yes

Relation learning No Yes Yes

Open and heterogeneous

environment

No No Yes

Table 2 Comparative analysis of the existing ontology-learning-based focused crawlers

3. System Architecture and Components

In this section, we introduce the system architecture and the functionalities of the components of the proposed SOF crawler.

The primary objective of this crawler is to maintain the precision of the ontology-based Web page focused crawling and classification,

by 1) enriching the vocabulary of ontologies, and 2) enabling the crawler itself to work in an uncontrolled Web environment. In order

to realize this objective, we propose a semi-supervised ontology learning approach, enabling the utilized ontology to evolve itself in an

uncontrolled environment, by learning unpredictable but semantically relevant terms extracted from Web pages.

We conclude four major functions of the proposed crawler as follows: 1) downloading Web pages from the Internet; 2) generating

metadata from Web pages, in which metadata are the semantic descriptions of Web pages; 3) using ontologies to classify relevant

metadata in order to classify relevant Web pages and filter out non-relevant Web pages; and 4) enriching the vocabulary of ontologies

by means of the terms extracted from Web pages. A sketch map of the ontology-based Web page classification is shown in Fig. 1.

Fig. 1 Sketch map of the ontology-based Web page classification

It needs to be noted that this crawler is built upon the semantic focused crawling frameworks designed in our previous research work 6,

7. In our previous research work, we designed two pure semantic focused crawlers, which do not have an ontology-learning function to

automatically evolve the utilized ontologies. This research aims to remedy this defect.

The system architecture and system workflow of the proposed SOF crawler is shown in Fig. 2. Basically the SOF crawler can be

divided into three components based on the functionalities, i.e., a storage component – the knowledge base, a processing component –

the crawling and processing module, and a computing component – the semi-supervised Web page classification and ontology

learning module. In the rest of this section, we will introduce the technical details regarding the three components.



Crawling

Term Extraction

Ontology

Preprocessing

Term Processing

Direct Concept-

Metadata

Matching(Y/N)

Metadata

Generation and

Association

SVM-based Concept-

Metadata Matching

(Y/N) FilteringN

Ontology Learning

Website

Web pages

Ontology Base

Metadata Base

Ontology

Metadata

Semi-supervised

Web page classification

and ontology learning

Knowledge Base

Crawling and

processing



Fig. 2 System architecture and system workflow of the SOF crawler

3.1 General Schemas of Ontology and Metadata in the Knowledge Base

The knowledge base consists of two components – an ontology base and a metadata base. The ontology base is designed with the

purpose of storing formal domain knowledge, i.e. ontologies, for ontology-based Web page filtering and classification. The metadata

base is used to store the semantically annotated information (i.e. metadata) with regard to Web pages. In order to realize the ontology-

based Web page filtering and classification as well as the semantic annotation, we define the general schemas respectively for

ontology and metadata. These two schemas can be customized according to the actual domain knowledge.

For the ontologies stored in the ontology base, it is reasonable to make use of a hierarchical ontology for Web page classification, in

which concepts are linked by the class/subclass relationship. Each concept represents the conceptualization of a specific topic, which

can be associated to semantically relevant Web pages. It needs to be noted that a Web page can be associated to more than one topics.

A subclass of a concept is a subtopic or a more specific topic of the topic represented by the concept. A superclass of a concept is the

upper topic or a more generalized topic of the topic represented by the concept. Therefore, taking into account the features of the

hierarchical ontologies, we can define the general schema of ontological concepts, instead of defining the general schema of a

hierarchical ontology. In addition to the class/subclass property, we define that each concept in a hierarchical ontology contains the

following elementary properties:

 A conceptDescription property is a datatype property used to store the textual descriptions of a concept, which consists of one or

more phrases or sentences. Each phrase or sentence is a description or definition of a concept, which is defined by domain

experts. This property will be used in the process of Web page classification.

 A learnedConceptDescription property is a datatype property that has the similar purpose as the conceptDescription property

does. The difference between the two properties is that the former is automatically learned from Web pages by the SOF crawler.

 A linkedMetadata property is an object property used to associate between a concept and a semantically relevant metadata. This

property is used to semantically index the generated metadata by means of the concepts in an ontology.

For the metadata stored in the metadata base, a metadata is the semantic descriptions of a Web page, which contains the following

elementary properties:

 A pageDescription property, which is a datatype property that stores the key terms and term frequencies used to describe the



topics of a Web page. The contents of this property are automatically extracted from the Web page by the SOF crawler, the

process of which will be introduced in Section 3.2. This property will be used for the forthcoming concept-metadata similarity

computation.

 A URL property, which is a datatype property used to store the URL of the Web page to which this metadata corresponds.

 A linkedConcept property, which is the inverse property of the linkedMetadata property. This property stores the URIs of the

semantically relevant concepts of the metadata. It needs to be noted that the metadata and the concepts can have a many-to-many

relationship.

3.2 System Workflow of the Modules

In this section, we introduce the functionalities of the crawling and processing module and the supervised Web page classification and

ontology learning module, in terms of the workflow of the proposed SOF crawler.

The crawling and processing module is designed with the purpose of crawling Web pages and processing the contents of Web pages

and ontologies for forthcoming computation. As can be seen in Fig. 2, the first process in this module is preprocessing, which is to

process the contents of the conceptDescription property of each concept in the ontology for the forthcoming concept-metadata

matching, before the SOF crawler starts crawling over the Internet. This process is realized by using Java WordNet Library3 (JWNL)

to implement tokenization, part-of-speech (POS) tagging, nonsense word filtering, stemming, synonym searching, and term weighting

for the conceptDescription properties of the concepts. For the task of term weighting, each term in the conceptDescription property is

associated with a weight, in order to indicate the particularity of this term in the ontology. Here we make use of the inverse document

frequency (IDF) model (based the assumption that the less frequently a term occurs, the more particularity the term has) for the weight

calculation. The term weight will be used for the forthcoming term processing process (Fig. 4). The algorithmic presentation of this

process is presented in Fig. 3.

Input: Cj are concepts of an ontology O, each concept Cj has a group of concept descr iptions CDjh, and each concept description

CDjhhasagroupoftermsCDjhl.



Output:root,synonyms,andweightofCDjhl–Wjhl.



Procedure:



forallconceptsCjdo

forallconceptsdescriptionsCDjhofaconceptCjdo

foralltermsCDjhlinaconceptdescriptionCDjhdo

RemovepunctuationsinCDjhl;

TokenizeCDjhl;

PerformPOStaggingforCDjhlbyWordNet;

RemovewordswithoutPOStagsinCDjhl;

PerformstemmingforCDjhlbyWordNet;

FindsynonymsofCDjhlfromWordNet;

CDjhl←CDjhl∪synonymsofCDjhl;

Wjhl←log |||∀∈ 

|||∀∈∩∃∈;

endfor

Fig. 3 Procedure of the preprocessing process

The second and third processes are crawling and term extraction. The missions of the two processes are to download a Web page from

the Internet at one time, and to extract required information from the downloaded Web page, according to the general metadata

schema defined in Section 3.1, in order to prepare the properties for generating a new metadata. These two processes are realized by

the semantic focused crawlers designed in our previous works 6, 7, in which the extraction rules and the templates are defined by

observing common patterns in the HTML codes. By means of these two processes, nearly all the properties of the metadata are

generated, except for the pageDescription property that contains unprocessed key terms

The third process is term processing, which is to process the contents of the pageDescription property of the metadata, in order to

prepare for the forthcoming concept-metadata matching. The implementation of this process is similar to the implementation of the



3 http://sourceforge.net/projects/jwordnet/



preprocessing process. The major differences are that 1) the term processing process does not need the function of synonym retrieval,

due to the provision of this function in the preprocessing process and the consideration of the computing cost; and 2) the term

processing process has a term frequency counting function, which is to count the frequency of the terms in the pageDescription

property. Similarly, the terms in the pageDescription property also needs to a weight for indicating their particularity. Here a term

matching function is designed for passing the weights of ontological terms obtained in the preprocessing process, in order to reduce

the computing cost in this real-time process. By means of this term matching function, the terms in the pageDescription property are

matched with the terms occurred in the conceptDescription properties of the concepts in an ontology. If two terms are matched, the

associated weight of the matched term in the ontology will be passed to the term in the pageDescription property; otherwise the term

in the pageDescription property will be regarded as a new term and assigned the maximum valid weight for its particularity, i.e., log

(number of concepts in the ontology), in terms of the IDF algorithm. The weights of terms will be used for the following SVM-based

concept-metadata matching process (Section 4.1). The algorithmic expression of the term processing process is shown in Fig. 4.

Input:PDisthepagedescriptionpropertyofaWebpageP,andPDcontainsagroupoftermsPDi.CjareconceptsofanontologyO,each

concepthasagroupofconceptdescriptionsCDjh.EachconceptdescriptionCDjhhasagroupoftermsCDjhl.EachtermCDjhl

isassociatedwithaweightWjhl.



Output:rootsofPDi,termfrequencyofPDi–TFi,termfrequencyofCDjhl–TFjhl,andweightofPDi–Wi.



Procedure:



foralltermsPDiinPdo

RemovepunctuationsinPDi

PerformPOStaggingforPDibyWordNet;

RemovewordswithoutPOStagsinPDi;

PerformstemmingforPDibyWordNet;

TFi←FrequencyofPDi;

foralltermsCDjhlinOdo

ifCDjhlPDithen

Wi←Wjhl;

TFjhl;

endif

endfor

ifWinullthen

Wilog|||∀ ∈ ;

endif

endfor

Fig. 4 Procedure of the term processing process

The rest part of the workflow can be integrated as a semi-supervised Web page classification and ontology learning module. The

detailed procedure of this module is described as follows: first of all, the direct string matching process examines whether or not the

content of the pageDescription property of a metadata is included in the conceptDescription and learnedConceptDescription

properties of a concept. If the answer is yes, then the concept and the metadata are regarded as semantically relevant. By means of the

metadata generation and association process, the metadata can then be generated and stored in the metadata base as well as associated

with the concept. If the answer is no, a Support Vector Machine (SVM)-based concept-metadata matching process will be invoked to

check the semantic relatedness between the metadata and the concept, by using a trained SVM model to assess the semantic

relatedness between the pageDescription property of the metadata and the phrases in the conceptDescription property of the concept,

the details of which will be introduced in Section 4. If the pageDescription property of the metadata is semantically relevant to any

phrases in the conceptDescription property of the concept, the metadata and the concept are regarded as semantically relevant, and the

contents of the pageDescription property of the metadata can be regarded as a new phrase for the learnedConceptDescription property

of the concept. The metadata is thus allowed to go through the metadata generation and association process; otherwise the metadata is

regarded as semantically non-relevant to the concept. The above process is repeated until all the concepts in the ontology are

compared to the metadata. If none of the concepts is semantically relevant to the metadata, this metadata is regarded as semantically

non-relevant to the domain represented by the ontology and will be filtered out. The algorithmic expression of the above processes is

revealed in Fig. 5.

Input:Cjareconceptsofanontology,eachconcepthasagroupofconceptdescriptionsCDjhandagroupoflearnedconcept

descriptions LCDjh. Each concept descriptionCDjh hasagroup of terms CDjhl, andeach learned concept description LCDjh



has a group of terms LCDj hl. Each term CDjh l is associated with a weight Wjhl. P  is a Web page, P has a pagedescription

propertyPD,andPDcontainsagroupoftermsPDi.EachtermPDiisassociatedwithaweightWi.



Output:1generate a metadata MifPisrelevanttoanyconceptsCj, 2 associate thesemantically relevant concepts CjandM, and 3

updatethelearnedconceptdescriptionsLCDjhifPisnotinCDjhandLCDjh.



Procedure:



forallconceptsCjdo

foralltheconceptsdescriptionsCDjhandthelearnedconceptdescriptionsLCDjhofaconceptCjdo

ifPD≡∃CDjhtrue∩lengthofPDlengthofCDjh∪PD≡∃LCDjhthen

ifMdoesnotexistthen

GenerateametadataM;

endif

AssociatebetweenMandCjbymutuallyreferencingtheirURIs;

break;

else

SimsPD,CDjh←thesimilarityvaluebetweenPDandCDjhbyaSemantic‐basedStringMatchingAlgorithm;

SimpPD,CDjh←thesimilarityvaluebetweenPDandCDjhby a Probability‐based String Matching

Algorithm;

ifSVMSimsPD,CDjh,SimpPD,CDjh1then

ifMdoesnotexistthen

GenerateametadataM;

endif

AssociatebetweenMandCjbymutuallyreferencingtheirURIs;

LCDjh←LCDjh∪PD;//AddthepagedescriptionPDintoLCDjh

break;

endif

endfor

Fig. 5 Procedure of the semi-supervised Web page classification and ontology learning module

Although the SVM-based concept-metadata matching process is a supervised process, since the inputs of the SVM model are

similarity values between two phrases in the pageDescription property of a metadata and in the conceptDescription property of a

concept (which will be introduced in Section 4), the SVM determines the semantic relatedness between two phrases based on their

semantic similarity, regardless of their actual content. The following ontology learning process, therefore, is able to learn the

uncontrolled new definitions (phrases) that may contain unpredictable new terms, based on the similarity values controlled by the

SVM model, which is thus viewed as a semi-supervised ontology learning process.

4. Support Vector Machine (SVM)-based Concept-Metadata Matching Model

In this section, we introduce the mathematical models utilized in the SVM-based concept-metadata matching process (Fig. 2). In the

supervised Web page classification and ontology learning module, if the descriptive terms extracted from a Web page, i.e., the

pageDescription property of a metadata, cannot directly match with any phrases in the conceptDescription property or

learnedConceptDescription property of a concept, the SVM-based concept-metadata matching process will be invoked to

mathematically examine their semantic relatedness. Fig. 6 indicates the workflow of the SVM-based concept-metadata matching

process. Each concept is associated with a particular SVM classifier, where the inputs of the classifier are the results of a semantic-

based string matching (SSM) algorithm and a probability-based string matching (PSM) algorithm between the concept and a metadata,

and the output of the SVM classifier is their binary semantic relatedness (relevant/non-relevant). From Section 4.1 to 4.3, we will

respectively introduce the SSM algorithm, the PSM algorithm, and the SVM classifier.



Fig. 6 Workflow of the SVM-based concept-metadata matching process

4.1 Semantic-based String Matching Algorithm

The key idea of the SSM algorithm is to measure the text similarity between a phrase in the conceptDescription property (abbreviated

as a concept description) of a concept and the pageDescription property (abbreviated as a page description) of a metadata, by means of

WordNet4 and a semantic similarity model.

A concept description and a page description can be regarded as two groups of terms with weights after the preprocessing and term

processing phase, in which terms in a concept description have their synonyms, and terms in a page description have their term

frequencies in the corresponding Web page. Therefore, we designed a weighted Dice’s coefficient algorithm to measure the semantic

similarity between a concept description and a page description, taking into account the requirements of high precision and short

response time for focused crawling. The mathematical expression of the weighted Dice’s coefficient algorithm is presented as follows:



4.2 Probability-based String Matching Algorithm

The PSM algorithm is a complementary solution for measuring the relevance between concepts and metadata, by measuring the co-

occurrence frequencies of a page description of a metadata and a concept description of a concept in the crawled Web pages, based on

a probabilistic model 22. In the crawling process and the subsequent processes indicated in Fig. 2, the SASF crawler downloads k Web

pages at the beginning, and automatically obtains the statistical data from the k Web pages, in order to compute the relevance between

the page description (PDi) of a metadata and a concept description (CDj,h) of a concept (Cj). The PSM algorithm follows a

unsupervised training paradigm, which aims at seeking the maximum probability that CDj,h and PDi co-occur in the trained Web pages.

A graphical representation of the PSM algorithm is shown in Fig. 8. The PSM algorithm is mathematically expressed as follows:

, ,

,,,, ,

maxSim ( , ) max [ ( | ) ( | )] max

jj jj

jh i

Pi jh j jh j i

CD C CD C jh i

PD CD P CD CD P CD PD nn

 





















 (3)

where CDj,θ is a concept description of Cj, ,



is the number of Web pages that contain both CDj,θ and CDj,h, nj,h is the number of Web

pages that contain CDj,h,,j



is the number of Web pages that contain both CDj,θ and PDi, and ni is the number of Web pages that

contain PDi.



maxSim ( , 2) max[ ( | 2) ( | )]

PCDj C

PD CD P CDj CD P CDj PD

Fig. 7. Graphical representation of the PSM algorithm

4.3 SVM Classification Algorithm

The SVM classifier for each concept is designed to best aggregate the results of the SSM algorithm and the PSM algorithm in order to

decide on the semantic relatedness between a concept description and a page description, through a supervised training paradigm. This



4 http://wordnet.princeton.edu/



classifier provides a binary classification function (relevant/non-relevant), which is characterized by a hyperplane in a given feature

space.

Let X = [0, 1]×[0, 1] be the feature space with feature vectors xi = (maxSimS(PDi, CDj,h), maxSimP(PDi, CDj,h)), in which the features

respectively represent the results of the SSM algorithm and the PSM algorithm. The yi value of the training set equals -1 for a pair of

semantically non-relevant concept description and page description, and 1 for the relevant pair. The yi values in the training set are

subjectively defined by domain experts. Eventually, the input of each SVM classifier is a set of training tuples {(x1, y1), …,(xm, ym)}

with xiX and yi{-1, 1}.

The result of a SVM is a maximum-margin hyperplane, which separates training examples in the feature space as precise as possible,

while the distance of the closest members on each side is maximized. This is expressed in the following optimization problem:

minimize , , : ,

wb w w C







(4)

where w and b describe the optimal hyperplane. The error term 1





is introduced to allow for outliers in a non-linear separable

training set, where i



is a slack variable and the penalty parameter C controls the trade-off between i



and the size of the margin.



a predefined function which maps features into a higher dimensional space, and a kernel function is required to reduce the

computational load in this process. In this experiment, we employed the radial basis function (RBF) as the kernel, as the number of

instances is far larger than the number of features. The RBF kernel function is defined as: 2

(, ) , 0

Kx x e







. We conducted

the v-fold cross validation and grid-search approach proposed in 23 in order to find the optimal C and γ. The theoretical details of SVM

can be referenced from 24.

5. System Implementation and Evaluation

In this section, in order to systematically evaluate the framework of the proposed SOF crawler, we implement a prototype of this

crawler, and compare the performance of the crawler with the existing work reviewed in Section 2, based on several performance

indicators adopted from the information retrieval (IR) field.

5.1 Prototype Implementation and Test Environment Setup

The overall framework of the SOF crawler is built in Java within the platform of Eclipse 3.7.15. The general ontology schema and

general metadata schema is implemented in OWL-DL within the platform of Protégé 3.4.76. The OWL API7 is utilized to access the

OWL file and the libSVM8 Java library is utilized for the implementation of the SVM classifiers. With the purpose of comparatively

analyzing between our work and the existing work, i.e., Zheng et al.’s and Su et al.’s ontology learning-based semantic focused

crawlers, we implement a prototype for each crawler in Java, in which the ANN model used by Zheng et al.’s crawler is built in

Encog9.

The test environment is initialized by two tasks: (1) the selection of a candidate ontology for ontology-based focused crawling and/or

classification, and ontology learning, and (2) the selection of Web pages for crawling, training, and testing. For the first task, we use a

previously designed mining service ontology, which represents the domain knowledge in the mining service industry. This mining

service ontology follows a four-level hierarchical structure, and consists of 158 concepts, in which each concept is defined by

following the general schema of ontological concepts introduced in Section 3.1. The mining service ontology is mostly referenced

from Wikipedia10, Australian Bureau of Statistics11, and the websites of nearly 200 Australian and international mining service

companies. The details of the mining service ontology can be referenced from 22. For Task 2, as mentioned in Section 2, one common

defect of the existing ontology-learning-based focused crawlers is that, these crawlers cannot keep their performance in an



5 http://www.eclipse.org/

6 http://protege.stanford.edu/

7 http://owlapi.sourceforge.net/

8 http://www.csie.ntu.edu.tw/~cjlin/libsvm/

9 http://code.google.com/p/encog-java/

10 http://en.wikipedia.org/

11 http://www.abs.gov.au/

subject to 1 : ( ( ) ) 1 , 0,

ii ii

iNyw x b



     



uncontrolled Web environment with unpredictable new terms, due to the limitations of the adopted ontology learning approaches.

Hence, our proposed SOF crawler aims to remedy this defect, by following a semi-supervised Web page classification and ontology

learning paradigm. In order to evaluate our crawler and the existing crawlers in an uncontrolled Web environment, we choose two

mainstream mining service advertising websites – Australian Kompass 12 (abbreviated as Kompass below) and Australian

Yellowpages®13 (abbreviated as Yellowpages® below), as the testing data source. There are around 800 downloadable mining-related

service or product advertisements registered in Kompass, and around 3200 similar advertisements registered in Yellowpages®, all of

which are published in English. Since Zheng et al’s crawler needs a supervised training process, Su et al.’s crawler needs an

unsupervised training process, and our proposed SOF crawler needs both supervised and unsupervised training processes, we label the

Web pages from Kompass, and use these Web pages as the training set for all of these crawlers. Subsequently, we test and compare

the performance of these crawlers on the task of crawling and classifying the Web pages from Yellowpages®, based on the

performance indicators introduced in the next section, with the purpose of evaluating their capability in this heterogeneous

environment.

5.2 Performance Indicators

We define the following parameters for comparing between our crawler and the existing ontology-learning-based focused crawlers.

All the indicators are adopted from the field of IR and need to be redefined in order to be applied in the scenario of ontology-based

focused crawling.

Harvest rate is used to measure the harvesting ability of a crawler. Harvest rate for a crawler ε after crawling μ Web pages is defined

as follows:

()||











 (5)

where||





is the number of associated metadata from the Web pages, and||





is the number of generated metadata from the

Web pages.

Precision is used to measure the preciseness of a crawler. Precision for a concept Cj after crawling μ Web pages is defined as follows:

|{ | }|

() ||

ijij









 



 (6)

where





is the set of associated metadata from the Web pages for Cj,||





is the number of associated metadata from the Web

pages for Cj, and





is the set of relevant metadata from the Web pages for Cj. It needs to be noted that the set of relevant metadata

need to be manually identified by peers before the evaluation.

Recall is used to measure the effectiveness of a crawler. Recall for a concept Cj after crawling μ Web pages is defined as follows:

|{ | }|

() ||

ijij







 



 (7)

where||



is the number of the relevant metadata from the Web pages for Cj.

Harmonic mean is used to measure the aggregated performance of a crawler. Harmonic mean for a concept Cj after crawling μ Web

pages is defined as follows:

() ()

() ()()

PC RC

HM C PC RC







 (8)



12 http://au.kompass.com/

13 http://www.yellowpages.com.au/



Fallout is used to measure the inaccuracy of a crawler. Fallout for a concept Cj after crawling μ Web pages is defined as follows:

|{ | }|

() ||

iji





 



 (9)

where



is the set of non-relevant metadata from the Web pages for Cj, and||



is the number of non-relevant metadata from the

Web pages for Cj. It needs to be noted that the set of non-relevant metadata need to be manually identified by peers before the

evaluation

Crawling time is used to measure the efficiency of a crawler. Crawling time of the SOF crawler for a Web page is defined as the time

interval of processing the Web page from the crawling process to the metadata generation and association process or to the filtering

process, as shown in Fig. 2.

5.3 System Evaluation

In this section, we evaluate the feasibility of the SOF crawler, by comparing its performance with the existing ontology-learning-based

focused crawlers, i.e., Zheng et al.’s crawler and Su et al.’s crawler introduced in Section 2. We compare the performance of the three

crawlers based on the six parameters introduced in Section 5.2. Since Zheng et al.’s crawler does not have the function of

classification, we only obtain its performance data on harvest rate and crawling time.

Fig. 8. Comparison of the SOF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on harvest rate

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200

NumbrtofCrawledWebPages

HarvestRate

SOF Su Zheng



Fig. 9. Comparison of the SOF crawler and Su et al.’s crawler on precision

Fig. 10. Comparison of the SOF crawler and Su et al.’s crawler on recall

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

200 400 600 800 1000 1200 1400 160 0 1800 2000 2200 2400 260 0 2800 3000 3200

NumbrtofCrawledWebPages

Precision

SOF Su

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200

NumberofCrawledWebPages

Recall

SOF Su



Fig. 11. Comparison of the SOF crawler and Su et al.’s crawler on Harmonic Mean

Fig. 12. Comparison of the SOF crawler and Su et al.’s crawler on fallout rate

The performance of the SIF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on crawling time is shown in Fig. 14. It can be seen

that initially there is no big difference among the three crawlers. Su et al.’s crawler takes the highest crawling time. Since Zheng et

al.’s crawler does not execute the task of classification, it needs less crawling time.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200

NumberofCrawledWebPages

HarmonicMean

SOF Su

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

200 400 600 800 1000 1200 1400 1600 1800 200 0 2200 2400 2600 2800 3000 3200

NumberofCrawledWebPages

FalloutRate

SOF Su



Fig. 13. Comparison of the SOF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on crawling time

6. Conclusion and Future Works

In this paper, we presented the framework of an innovative semi-supervised ontology-learning-based focused crawler – the SOF

crawler, in order to maintain the performance of ontology-based semantic focused crawling in an open and heterogeneous Web

environment. Due to the limitations in the adopted ontology learning approaches, the existing ontology-learning-based focused

crawlers cannot work in an uncontrolled Web environment where contains unpredictable new terms. Hence, in the framework of the

SOF crawler, we proposed a semi-supervised ontology learning approach, which enables the utilized ontological concepts to

automatically learn new definitions from the semantically relevant Web information, while keeping its performance in focused

crawling and classification. A semantic-based string matching algorithm and a probability-based string matching algorithm were

designed to measure the semantic relatedness between ontological concepts and Web-information-generated metadata, respectively

from the perspectives of semantic similarity and statistical data. A SVM model was trained to eventually determine the binary

relatedness (relevant/non-relevant) between a pair of concept and metadata, by aggregating the results from the two algorithms. In

order to evaluate the research outcome, we built the prototypes of the SOF crawler and two existing ontology-learning-based focused

crawlers. Next, we tested and compared their performance based on several IR indicators, in a simulated heterogeneous Web

environment. The comparison results preliminarily prove the feasibility and technical advantages of the proposed SOF crawler.

For the future work, we will focus on the following research tasks in the area of ontology-learning-based focused crawling: 1) we will

try to incorporate other ontology learning approaches into this framework in order to achieve better performance for ontology-based

focused crawling and classification; and 2) we will test the performance of this framework in other domains by developing new topical

ontologies or modifying the existing topical ontologies according to the defined general schema of ontological concepts.

References

1Batzios,A.,Dimou,C.,Symeonidis,A.L.,andMitkas,P.A.:‘BioCrawler:Anintelligentcrawlerforthesemanticweb’,Expert

SystemswithApplications,2008,35,(1–2),pp.524‐530

2Aggarwal,C.C.,Al‐Garawi,F.,andYu,P.S.:‘IntelligentcrawlingontheWorldWideWebwitharbitrarypredicates’.Proc.

Proceedingsofthe10thinternationalconferenceonWorldWideWeb(WWW'01),NewYork,NY,USA2001pp.96‐105

3Chakrabarti,S.,Berg,M.v.d.,andDom,B.:‘Focusedcrawling:anewapproachtotopic‐specificWebresourcediscovery’.

Proc.ProceedingsoftheeighthinternationalconferenceonWorldWideWeb(WWW'99),NewYork,NY,USA1999pp.1623‐1640

4Su,C.,Gao,Y.,Yang,J.,andLuo,B.:‘Anefficientadaptivefocusedcrawlerbasedonontologylearning’.Proc.Proceedingsof

theFifthInt.Conf.onHybridIntelligentSyst.(HIS'05)RiodeJaneiro,Brazil,6‐9Nov.20052005pp.73‐78

5Zheng,H.‐T.,Kang,B.‐Y.,andKim,H.‐G.:‘Anontology‐basedapproachtolearnablefocusedcrawling’,Inform.Sciences,

2008,178,(23),pp.4512‐4522

6Dong,H.,andHussain,F.K.:‘Focusedcrawlingforautomaticservicediscovery,annotation,andclassificationinindustrial

digitalecosystems’,IEEETrans.Ind.Electron.,2011,58,(6),pp.2106‐2116

50000

100000

150000

200000

250000

300000

350000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200

NumberofCrawledWebPages

CrawlingTime(Unit:ms)

SOF Su Zheng



7Dong,H.,Hussain,F.K.,andChang,E.:‘Aframeworkfordiscoveringandclassifyingubiquitousservicesindigitalhealth

ecosystems’,J.ofComput.andSyst.Sci.,2011,77,(4),pp.687‐704

8Ehrig,M.,andMaedche,A.:‘Ontology‐focusedcrawlingofWebdocuments’.Proc.ProceedingsoftheEighteenthAnnual

ACMSymposiumonAppliedComputing(SAC2003),Melbourne,USA2003pp.9‐12

9Ganesh,S.,Jayaraj,M.,Kalyan,V.,andAghila,G.:‘Ontology‐basedWebcrawler’.Proc.Proceedingsofthe2004

InternationalConferenceonInformationTechnology:CodingandComputing(ITCC'04),LasVegas,USA2004pp.337‐341

10Halkidi,M.,Nguyen,B.,Varlamis,I.,andVazirgiannis,M.:‘THESUS:OrganizingWebdocumentcollectionsbasedonlink

semantics’,TheVLDBJournal,2003,12,(4),pp.320–332

11Huang,W.,Zhang,L.,Zhang,J.,andZhu,M.:‘Semanticfocusedcrawlingforretrievingecommerceinformation’,Journalof

Software,2009,4,(5),pp.436‐443

12Yuvarani,M.,Iyengar,N.C.S.N.,andKannan,A.:‘LSCrawler:aframeworkforanenhancedfocusedWebcrawlerbasedon

linksemantics’.Proc.Proceedingsofthe2006IEEE/WIC/ACMInternationalConferenceonWebIntelligence(WI'06),HongKong2006

pp.794‐800

13Toch,E.,Gal,A.,Reinhartz‐Berger,I.,andDori,D.:‘Asemanticapproachtoapproximateserviceretrieval’,ACM

TransactionsonInternetTechnology,2007,8,(1),pp.2‐31

14Can,A.B.,andBaykal,N.:‘MedicoPort:Amedicalsearchengineforall’,ComputerMethodsandProgramsinBiomedicine,

2007,86,(1),pp.73‐86

15Cesarano,C.,d'Acierno,A.,andPicariello,A.:‘Anintelligentsearchagentsystemforsemanticinformationretrievalonthe

internet’.Proc.ProceedingsoftheFifthInternationalWorkshoponWebInformationandDataManagement(WIDM'03),New

Orleans,USA2003pp.111‐117

16Batzios,A.,Dimou,C.,Symeonidis,A.L.,andMitkas,P.A.:‘BioCrawler:AnintelligentcrawlerfortheSemanticWeb’,Expert

SystemswithApplications,2008,35,(1‐2),pp.524‐530

17Liu,H.,Milios,E.,andJanssen,J.:‘ProbabilisticmodelsforfocusedWebcrawling’.Proc.Proceedingsofthe6thAnnualACM

InternationalWorkshoponWebInformationandDataManagement(WIDM'04),WashingtonD.C.,USA2004pp.16‐22

18Dong,H.,Hussain,F.,andChang,E.:‘Stateoftheartinsemanticfocusedcrawlers’,inGervasi,O.,Taniar,D.,Murgante,B.,

Lagana,A.,Mun,Y.,andGavrilova,M.(Eds.):‘ComputationalSci.andItsApplicat.‐ICCSA2009’(SpringerBerlin/Heidelberg,2009),

pp.910‐924

19Gruber,T.R.:‘Atranslationapproachtoportableontologyspecifications’,KnowledgeAcquisition,1993,5,(2),pp.199‐220

20Wong,W.,Liu,W.,andBennamoun,M.:‘Ontologylearningfromtext:Alookbackandintothefuture’,ACMComputing

Surveys,2011,X,(X),pp.Toappear

21Rennie,J.,andMcCallum,A.:‘UsingreinforcementlearningtospidertheWebefficiently’.Proc.ProceedingsoftheSixteenth

Int.Conf.onMach.Learning(ICML'99),Bled,Slovenia1999pp.335‐343

22Dong,H.,andHussain,F.K.:‘Self‐adaptivesemanticfocusedcrawlerforminingservicesinformationdiscovery’,IEEETrans.

Ind.Informat.,2012,Submitted

23Hsu,C.‐W.,Chang,C.‐C.,andLin,C.‐J.:‘Apracticalguidetosupportvectorclassification’,inEditor(Ed.)^(Eds.):‘BookA

practicalguidetosupportvectorclassification’(DepartmentofComputerScienceandInformationEngineering,NationalTaiwan

University,2007,edn.),pp.

24Boser,B.E.,Guyon,I.M.,andVapnik,V.N.:‘Atrainingalgorithmforoptimalmarginclassifiers’.Proc.Proceedingsofthefifth

annualworkshoponComputationallearningtheoryPittsburgh,Pennsylvania,UnitedStates1992pp.144‐152

Amelioration of linguistic semantic classifier with sentiment classifier manacle for the focused web crawler

Article

Dec 2022

Sentiment relevant information in the web pages concerning products, establishment, and commodities concentrates principally on the available textual contents. Research on crawling topic-relevant web pages is far behind compared to sentiment-relevant web pages despite the steep rise in sentiment-relevant information on the web. This paper resolves the impediment issues and proposes a novel focused web crawler namely the Linguistic Semantic Sentiment (LSS) crawler which collects not only topic-relevant web pages but also sentiment-relevant web pages. Two classifiers are proposed in the relevance computation module of the LSS crawler, where one is a linguistic semantic classifier and the other is a sentiment classifier. The linguistic semantic classifier computes the semantic relevance of the web page concerning the topic, whereas the sentiment classifier computes the sentiment relevance of the web page. The performance of the LSS crawler is then analyzed by using the metrics, harvest rate, target recall, and F1-score. The LSS crawler outperformed the existing focused crawlers with an average harvest rate of 0.35, target recall of 0.55, and F1-score of 0.42. The evaluation results revealed that both the linguistic semantic and the sentiment classifiers enhanced the performance of the proposed LSS-focused crawler.

An Optimal Topic Centric Crawler for Acquiring Bio-medical Themes Utilizing Gaussian Support Vector Regression

Article

Full-text available

Oct 2023

Focused crawler (FC) is a web crawler that downloads only relevant web pages for a given topic. The main source of biomedical information is now the Internet. The volume, pace, variety, and caliber of online biomedical information, however, pose difficulties and necessitate ameliorated facilitation methods for biological information to crawl. The search engine must have an efficacious, targeted crawler mechanism in order to retrieve precise biomedical information. To address these challenges a new FC is proposed using Gaussian support vector regression to calculate the importance of the web page. The synonym computation of the topic term using popular biomedical ontology unified medical language system helps the proposed crawler to improve the performance of relevance computation module. The newly designed crawler outperforms existing crawlers with an average harvest rate hrate\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {h_{{{\text{rate}}}} } \right)$$\end{document} of 0.37 and an average irrelevance ratio prate\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {p_{{{\text{rate}}}} } \right)$$\end{document} of 0.63 after 5000 webpage crawls on biomedical topics. Experimental results reveal the proposed FC improved performance of focused crawling for biomedical topics in crawling environment.

A semantic and intelligent focused crawler based on semantic vector space model and membrane computing optimization algorithm

Article

Full-text available

Jul 2022
APPL INTELL

The focused crawler downloads web pages related to the given topic from the Internet. In many research studies, most of focused crawler predict the priority values of unvisited hyperlinks by integrating the topic similarities based on the text similarity model and equivalent weighted factors based on the manual method. However, in these focused crawlers, there are flaws in the text similarity models, and weighted factors are arbitrarily determined for calculating priorities of unvisited URLs. To solve these problems, this paper proposes a semantic and intelligent focused crawler based on the Semantic Vector Space Model (SVSM) and the Membrane Computing Optimization Algorithm (MCOA). Firstly, the SVSM method is used to calculate topic similarities between texts and the given topic. Secondly, the MCOA method is used to optimize four weighted factors based on the evolution rules and the communication rule. Finally, this proposed focused crawler predicts the priority of each unvisited hyperlink by integrating the topic similarities of four texts and the optimal four weighted factors. The experiment results indicate that the proposed SVSM-MCOA Crawler improve the evaluation indicators compared with the other four focused crawlers. In conclusion, the proposed SVSM and MCOA method promotes the focused crawler to have semantic understanding and intelligent learning ability.

Process Activity Ontology Learning From Event Logs Through Gamification

Article

Full-text available

Dec 2021

The quality of event log data is a constraining factor in achieving reliable insights in process mining. Particular quality problems are posed by activity labels which are meant to be representative of organisational activities, but may take different manifestations (e.g. as a result of manual entry synonyms may be introduced). Ideally, such problems are remedied by domain experts, but they are time-poor and data cleaning is a time-consuming and tedious task. Ontologies provide a means to formalise domain knowledge and their use can provide a scalable solution to fixing activity label similarity problems, as they can be extended and reused over time. Existing approaches to activity label quality improvement use manually-generated ontologies or ontologies that are too general (e.g. WordNet). Limited attention has been paid to facilitating the development of purposeful ontologies in the field of process mining. This paper is concerned with the creation of activity ontologies by domain experts. For the first time in the field of process mining, their participation is facilitated and motivated through the application of techniques from crowdsourcing and gamification. Evaluation of our approach to the construction of activity ontologies by 35 participants shows that they found the method engaging and that its application results in high-quality ontologies.

A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network

Article

Full-text available

Jan 2020

Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58.

Towards extracting event-centric collections from Web archives

Article

Full-text available

Mar 2020
Int J Digit Libr

Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

Weakly supervised learning for an effective focused web crawler

Article

Jun 2024
ENG APPL ARTIF INTEL

A supervised learning‐based approach for focused web crawling for IoMT using global co‐occurrence matrix

Article

Full-text available

Mar 2022
EXPERT SYST

Irrelevant search results for a given topic end up wasting search engine users' time. A learning focused web crawler downloads relevant URLs for a given topic using machine‐learning algorithms. The dynamic nature of the web is a challenge in related computation for focused web crawlers. Studies have shown that the learning focused crawler utilizes term frequency‐inverse document frequency (TF‐IDF) to compute the relevance between a web page and a given topic. The TF‐IDF detects similarity of the given topic to its co‐occurrence on the web page. The necessity of efficient mechanism to compute the relevance of URLs syntactically and semantically has led to the proposal of this paper with a word embedding approach to compute the relevance of the web page. The global vector representation cosine similarity is calculated between a topic and the web page contents. The calculated cosine similarity is provided as input to the trained random forest classifier to predict the relevancy of the web page. The evaluation results proved that the proposed crawler produced an average hrate of 0.41 and prate of 0.59, which outperformed other learning‐focused crawlers on support vector machines, Naive Bayes and artificial neural networks.

A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages

Article

Full-text available

Mar 2020

Erdinç Uzun

Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, these studies first create a DOM tree and then access the necessary data through this tree. The construction process of this tree increases the time cost depending on the data structure of the DOM Tree. In the current web scraping literature, it is observed that time efficiency is ignored. This study proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree. The string methods consist of the following consecutive steps: searching for a given pattern, then calculating the number of closing HTML elements for this pattern, and finally extracting content for the pattern. In the crawling process, our approach collects the additional information, including the starting position for enhancing the searching process, the number of inner tag for improving the extraction process, and tag repetition for terminating the extraction process. The string methods of this novel approach are about 60 times faster than extracting with the DOM-based method. Moreover, using these additional information improves extraction time by 2.35 times compared to using only the string methods. Furthermore, this approach can easily be adapted to other DOM-based studies/parsers in this task to enhance their time efficiencies.

Mirkwood: An Online Parallel Crawler

Chapter

Jan 2020

In this research we present Mirkwood, a parallel crawler for fast and online syntactic analysis of websites. Configured by default to behave as a focused crawler, analysing exclusively a limited set of hosts, it includes seed extraction capabilities, which allows it to autonomously obtain high quality sites to crawl. Mirkwood is designed to run in a computer cluster, taking advantage of all the cores of its individual machines (virtual or physical), although it can also run on a single machine. By analysing sites online and not downloading the web content, we achieve crawling speeds several orders of magnitude faster than if we did, while assuring that the content we check is up to date. Our crawler relies on MPI, for the cluster of computers, and threading, for each individual machine of the cluster. Our software has been tested in several platforms, including the Supercomputer Calendula. Mirkwood is entirely written in Java language, making it multi–platform and portable.

Ontology-Learning-Based Focused Crawling for Online Service Advertising Information Discovery and Classification

Conference Paper

Full-text available

Nov 2012

Online advertising has become increasingly popular among SMEs in service industries, and thousands of service advertisements are published on the Internet every day. However, there is a huge barrier between service-provider-oriented service information publishing and service-customer-oriented service information discovery, which causes that service consumers hardly retrieve the published service advertising information from the Internet. This issue is partly resulted from the ubiquitous, heterogeneous, and ambiguous service advertising information and the open and shoreless Web environment. The existing research, nevertheless, rarely focuses on this research problem. In this paper, we propose an ontology-learning-based focused crawling approach, enabling Web-crawler-based online service advertising information discovery and classification in the Web environment, by taking into account the characteristics of service advertising information. This approach integrates an ontology-based focused crawling framework, a vocabulary-based ontology learning framework, and a hybrid mathematical model for service advertising information similarity computation.

Self-Adaptive Semantic Focused Crawler for Mining Services Information Discovery

Article

Full-text available

May 2014

It is well recognized that the Internet has become the largest marketplace in the world, and online advertising is very popular with numerous industries, including the traditional mining service industry where mining service advertisements are effective carriers of mining service information. However, service users may encounter three major issues – heterogeneity, ubiquity, and ambiguity, when searching for mining service information over the Internet. In this paper, we present the framework of a novel self-adaptive semantic focused crawler – SASF crawler, with the purpose of precisely and efficiently discovering, formatting, and indexing mining service information over the Internet, by taking into account the three major issues. This framework incorporates the technologies of semantic focused crawling and ontology learning, in order to maintain the performance of this crawler, regardless of the variety in the Web environment. The innovations of this research lie in the design of an unsupervised framework for vocabulary-based ontology learning, and a hybrid algorithm for matching semantically relevant concepts and metadata. A series of experiments are conducted in order to evaluate the performance of this crawler. The conclusion and the direction of future work are given in the final section.

Ontology Learning from Text: A Look Back and into the Future

Article

Full-text available

Jan 2011

Ontologies are often viewed as the answer to the need for interoperable semantics in modern information systems. The explosion of textual information on the Read/Write Web coupled with the increasing demand for ontologies to power the Semantic Web have made (semi-)automatic ontology learning from text a very promising research area. This together with the advanced state in related areas, such as natural language processing, have fueled research into ontology learning over the past decade. This survey looks at how far we have come since the turn of the millennium and discusses the remaining challenges that will define the research directions in this area in the near future.

A Translational Approach to Portable Ontologies

Article

Full-text available

Jun 1993
Knowl Acquis

Thomas Gruber

To support the sharing and reuse of formally represented knowledge among AI systems, it is useful to define the common vocabulary in which shared knowledge is represented. A specification of a representational vocabulary for a shared domain of discourse—definitions of classes, relations, functions, and other objects—is called an ontology. This paper describes a mechanism for defining ontologies that are portable over representation systems. Definitions written in a standard format for predicate calculus are translated by a system called Ontolingua into specialized representations, including frame-based systems as well as relational languages. This allows researchers to share and reuse ontologies, while retaining the computational benefits of specialized implementations.We discuss how the translation approach to portability addresses several technical problems. One problem is how to accommodate the stylistic and organizational differences among representations while preserving declarative content. Another is how to translate from a very expressive language into restricted languages, remaining system-independent while preserving the computational efficiency of implemented systems. We describe how these problems are addressed by basing Ontolingua itself on an ontology of domain-independent, representational idioms.

THESUS: Organizing Web document collections based on link semantics

Article

Full-text available

Jan 2003

The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a pages classification is enriched by the detection of its incoming links semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages incoming links, and converts them to semantics by mapping them to a domains ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.

Focused Crawling for Automatic Service Discovery, Annotation, and Classification in Industrial Digital Ecosystems

Article

Full-text available

Jul 2011

Digital Ecosystems make use of Service Factories for service entities' publishing, classification, and management. However, before the emergence of Digital Ecosystems, there existed ubiquitous and heterogeneous service information in the Business Ecosystems environment. Therefore, dealing with the preexisting service information becomes a crucial issue in Digital Ecosystems. This issue has not been addressed previously in the literature. In order to resolve this issue, in this paper, we present a conceptual framework for a semantic focused crawler, with the purpose of automatically discovering, annotating, and classifying the service information with the Semantic Web technologies. The technical and evaluation details of the framework are also presented and discussed in this paper.

A practical guide to support vector classification

Article

Jan 2008

A Practical Guide to Support Vector Classification

Article

Jan 2003

A Translation Approach to Portable Ontology Specifications

Article

Nov 1992
Knowl Acquis

Thomas Gruber

An ontology-based approach to learnable focused crawling

Article

Dec 2008
INFORM SCIENCES

Focused crawling is aimed at selectively seeking out pages that are relevant to a predefined set of topics. Since an ontology is a well-formed knowledge representation, ontology-based focused crawling approaches have come into research. However, since these approaches utilize manually predefined concept weights to calculate the relevance scores of web pages, it is difficult to acquire the optimal concept weights to maintain a stable harvest rate during the crawling process. To address this issue, we proposed a learnable focused crawling framework based on ontology. An ANN (artificial neural network) was constructed using a domain-specific ontology and applied to the classification of web pages. Experimental results show that our approach outperforms the breadth-first search crawling approach, the simple keyword-based crawling approach, the ANN-based focused crawling approach, and the focused crawling approach that uses only a domain-specific ontology.

SOF: A semi-supervised ontology-learning-based focused crawler

Abstract and Figures

Recommended publications

Using ontology for measuring semantic similarity for question answering system

A semantic similarity measure for objects described with multi-valued categorical attributes

Semantic Beliefs Fusion

Common Sense and Folksonomy: Engineering an Intelligent Search System