ArticlePDF Available

SOF: A semi-supervised ontology-learning-based focused crawler

Authors:

Abstract and Figures

The rapid increase in the volume of data available on the Internet makes it increasingly impractical for a crawler to index the whole Web. Instead, many intelligent crawlers, known as ontology-based semantic focused crawlers, have been designed by making use of Semantic Web technologies for topic-centered Web information crawling. Ontologies, however, have constraints of validity and time, which may influence the performance of the crawlers. Ontology-learning-based focused crawlers are therefore designed to automatically evolve ontologies by integrating ontology learning technologies. Nevertheless, surveys indicate that the existing ontology-learning-based focused crawlers do not have the capability to automatically enrich the content of ontologies, which makes these crawlers unreliable in the open and heterogeneous Web environment. Hence, in this paper, we propose a framework for a novel semi-supervised ontology-learning-based focused (SOF) crawler, the SOF crawler, which embodies a series of schemas for ontology generation and Web information formatting, a semi-supervised ontology learning framework, and a hybrid Web page classification approach aggregated by a group of support vector machine models. A series of tests are implemented to evaluate the technical feasibility of this proposed framework. The conclusion and the future work are summarized in the final section. Copyright © 2012 John Wiley & Sons, Ltd.
Content may be subject to copyright.
1
SOF: A Semi-supervised Ontology-learning-based Focused Crawler 1
Hai Dong^, Farookh Khadeer Hussain*, and Elizabeth Chang^
^School of Information Systems, Curtin Business School, Curtin University of Technology, Perth, WA 6845, Australia
*School of Software, Faculty of Engineer and Information Technology, University of Technology, Sydney, Ultimo, NSW 2007,
Australia
Abstract—The dynamic innovation of Internet technologies drives the fast growth of data volume on the Web, which makes it
increasingly impractical for a crawler to index the whole Web. Instead, many intelligent crawlers, known as ontology-based semantic
focused crawlers, are designed, by making use of Semantic Web technologies for topic-centered Web information crawling.
Ontologies, however, have the constraints of validity and time, which may influence the performance of the crawlers. Ontology
learning-based focused crawlers are therefore designed to automatically evolve ontologies by integrating ontology learning
technologies. Nevertheless, the survey indicated that the existing ontology-learning-based focused crawlers do not have the capability
to automatically enrich the content of ontologies, which makes the crawlers unreliable in the open and heterogeneous Web
environment. Hence, in this paper, we proposed the framework of a novel semi-supervised ontology learning-based focused crawler
the SOF crawler, which embodies a series of schemas for ontology generation and Web information formatting, a semi-supervised
ontology learning framework, and a hybrid Web page classification approach aggregated by a SVM model. A series of tests are
implemented to evaluate the technical feasibility of this proposed framework. The conclusion and the future work are summarized in
the final section.
Keywords—ontology-learning-based-focused crawler, ontological term learning, probabilistic model, semantic focused crawler,
semi-supervised ontology learning, semantic similarity model, support vector machine.
1. Introduction
With broadbandization of networks, popularization of multimedia broadcasting, and fusion of social networks and digital services, we
are living in the era of information explosion. According to a study2 conducted by IDC, in 2011, the amount of information created
and replicated surpassed 1.8 zettabytes (1.8 trillion gigabytes), which was 9 times as many as it of five years ago. Therefore, it is not
difficult to understand how hard to collect required information on the Web. Although popular search engines use bots, commonly
referred to as crawlers or spiders, to traverse the Web and index the Web information. However, due to the dynamic growth of the
Web, it is becoming increasingly impractical for crawlers to index the whole Web1. Instead, many intelligent crawlers, known as
topical/focused crawlers, are designed to find Web pages of a particular kind or on a particular topic, by avoiding hyperlinks that lead
to off-topic areas, and by concentrating on links to Web pages of interest 2, 3. Nevertheless, since the topical information used for
focused crawling is described by plain texts, there exists an ambiguity issue in the topical information, as a result of the nature of
natural languages. This issue may further result in the ambiguous crawling boundaries and the low precision of focused crawling. One
solution of this issue is to apply the background knowledge of crawling topics to focused crawling. Ontology, as a form of formal
representation of domain-specific knowledge, can be used in focused crawling to semantically define topical boundaries and enhance
the crawling precision. As ontologies are created by domain experts in terms of domain experts’ worldview, in order to represent the
current knowledge in a domain, two questions arise, which are, 1) does an ontology really reflect the knowledge in a domain in the
real world, and 2) how long can an ontology reflect the knowledge in a domain in the real world? With the consideration of the
validity and the time constraint of ontologies, several ontology-learning-based focused crawlers 4, 5 are proposed, in order to evolve
ontologies and keep their high performance in the focused crawling process. However, based on a survey in this research, it is found
that the existing methodologies in this area cannot guarantee their performance in an open and heterogeneous Web environment where
numerous unpredictable new terms emerge in Web pages, since the existing methodologies do not have the functionality of
ontological term learning.
Based on the above factor, in this paper, we propose the framework of a novel ontology-learning-based focused crawler – SOF crawler,
in order to realize the ontological term learning, and the high performance ontology-based focused crawling and Web page
classification, in an open and heterogeneous Web environment. By means of this crawler, crawling topics are represented by
ontological concepts, and metadata are built on Web pages in order to semantically describe their content. In addition, this crawler

1 This is a preprint version of the paper: Dong, H., Hussain, F.K.: SOF: A semi-supervised ontology-learning-based focused crawler. Concurrency and Computation:
Practice and Experience 25(12) (August 2013) pp. 1755-1770. Download link: http://onlinelibrary.wiley.com/doi/10.1002/cpe.2980/abstract
2 http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm
2
contains a semi-supervised ontology learning framework, enabling the continuous enrichment of the definitions of ontological
concepts and thus maintaining its performance in the focused crawling and Web page classification process in an open and
heterogeneous Web environment. In the semi-supervised ontology learning framework, a semantic similarity model and a probabilistic
model are designed to respectively measure the similarity between crawling topics and Web pages from different perspectives. A
Support Vector Machine (SVM) model is eventually trained to aggregate the results of the two models, in order to determine the
semantic relevance between crawling topics and Web pages.
The reminder of this paper is organized as follows. In Section 2, we briefly introduce the research areas of semantic focused crawling
and ontology-learning-based focused crawling, and review the previous work in the area of ontology learning-based focused crawling.
In Section 3 we introduce the system architecture and the functionalities of the components of the proposed SOF crawler, including
the general schemas for topical ontology building and Web page metadata generation, and the workflow of the semi-supervised
ontology learning and Web page classification process. In Section 4, we introduce the mathematical models employed in the semi-
supervised ontology learning and Web page classification process. In Section 5, we reveal the prototype implementation details of this
crawler, and evaluate its technical advantages by comparing its performance with the performance of the existing ontology-learning-
based focused crawlers. In Section 6, we summarize the technical features of this SOF crawler and draw our future research directions
in the area of ontology-learning-based focused crawling.
2. Related Works
In this section, we briefly introduce the research areas of semantic focused crawling and ontology-learning-based focused crawling,
and review the previous work in the area of ontology learning-based focused crawling.
A semantic focused crawler is a software agent that is able to traverse the Web, and retrieve as well as download related Web
information for specific topics, by means of semantic Web technologies 6, 7. The goal of semantic focused crawlers is to precisely and
efficiently retrieve and download relevant Web information by understanding the semantics underlying the Web information and the
semantics underlying the predefined topics. The semantic focused crawlers can briefly be classified into two clusters – the ontology-
based semantic focused crawlers and the non-ontology-based semantic focused crawlers, in terms of use of ontologies. The former
refers to the crawlers which make use of ontologies to represent the knowledge underlying topics and Web pages, and link the fetched
Web pages with semantically relevant ontological concepts, with the purpose of focused crawling and Web page classification 8-13.
The latter refers to the crawlers that make use of other Semantic Web technologies for focused crawling and Web page classification
14-17. According to a survey conducted by Dong et al. 18, it is found that most of the crawlers in this domain belong to the first cluster.
However, the limitation of the ontology-based semantic focused crawlers is that their crawling performance crucially depends on the
quality of ontologies. Furthermore, the quality of ontologies may be affected by two issues. The first issue is that, as it is well known
that an ontology is the formal representation of specific domain knowledge 19 and ontologies are designed by domain experts, there
may exist a gap between the domain experts’ understanding to the domain knowledge and the domain knowledge that exists in the real
world. The second issue is that, knowledge in the real world is dynamic and is in a persistently evolving process, compared with
relatively static ontologies. These two contradictory situations could lead to the problem that ontologies sometimes cannot precisely
represent the real-world knowledge, considering the issues of differentiation and dynamism. The reflection of this problem in the field
of semantic focused crawling is that the ontologies used by semantic focused crawlers cannot precisely represent the knowledge
revealed in Web information, since Web information is mostly created or updated by human users with different knowledge
understandings, and human users are efficient learners of new knowledge. This eventual consequence of this problem could be
reflected in the gradually descending curves in the performance of semantic focused crawlers.
In order to solve the defect of ontologies and maintain or enhance the performance of semantic focused crawlers, researchers start to
pay their attention to enhancing semantic focused crawling technologies by integrating them with ontology learning technologies. The
goal of ontology leaning is to semi-automatically extract facts or patterns from corpus or data and turn facts and patterns into machine-
readable ontologies 20. Various techniques have been designed for ontology learning, such as statistics-based techniques, linguistics (or
natural language processing)-based techniques, logic-based techniques, etc… These techniques can also be classified into supervised
techniques, semi-supervised techniques, and unsupervised techniques from the perspective of learning control. Obviously, ontology-
learning-based techniques can be used to solve the issue of semantic focused crawling, by learning new knowledge from crawled
documents and integrate the new knowledge with ontologies in order to persistently refine the ontologies.
In the rest of this section, we will review the existing works in the field of ontology learning-based semantic focused crawling. It is
found that few studies have been conducted in this field.
3
Zheng et al. 5 proposed a supervised ontology-learning-based focused crawler that aims to maintain the harvest rate of the crawler in
the crawling process. The main idea of this crawler is to construct an artificial neural network (ANN) model to determine the
relatedness between a Web page and an ontology. Given a domain-specific ontology and a topic represented by a concept in the
ontology, a set of relevant concepts are selected to represent the background knowledge about the topic, by counting the distance
between the topic concept and the other concepts in the ontology. The crawler then calculates the term frequencies of the relevant
concepts occurring in the visited Web pages. Next, the authors used the backpropagation algorithm to train a three-layer feedforward
ANN, the specification of which is shown in Table 1. The output of the ANN is the relevance score between the topic and a Web page.
The training process follows a supervised paradigm, by which the ANN is trained by labeled Web pages. The training will not stop
until the root mean square error (RMSE) is smaller than 0.01. The limitations of this approach are that, 1) it can only be used to
enhance the harvest rate of crawling but does not have the function of classification; 2) it cannot be used to evolve ontologies by
enriching the vocabulary of ontologies; and 3) the supervised learning may not work within an uncontrolled Web environment with
unpredictable new terms.
Input 1st layer Hidden layer Output layer
Frequency of relevant
concepts - xi Linear function yj = Wji xi Sigmoid transfer function
1
1j
j
y
z
e
Sigmoid transfer function
1
1j
z
o
e
Notion xi (i = 1…n) - a vector;
n - Number of relevant concepts in an ontology Wji (j = 1…4, i = 1…n) - a weight matrix
Table 1 Zheng et al.’s ANN model
Su et al. 4 proposed an unsupervised ontology-learning-based focused crawler in order to compute the relevance scores between topics
and Web pages. Given a specific domain ontology and a topic represented by a concept in this ontology, the relevance score between a
Web page and the topic is the weighted sum of the occurrence frequencies of all the concepts of the ontology in the Web page. The
original weight of each concept Ck is (,)
1.00 k
k
dC t
O
C
Wn , where n is a predefined discount factor, and d(Ck, t) is the distance between
the topic concept t and Ck. Next, this crawler makes use of reinforcement learning, which is a probabilistic framework for learning
optimal decision making from reward or punishment 21, in order to train the weight of each concept. The learning step follows an
unsupervised paradigm, which uses the crawler to download a number of Web pages and learn statistics based on these Web pages.
The learning step can be repeated many times. The weight of a concept Ck to a topic t in learning step m is mathematically expressed
as follows:
111
() (|)
() ( ) ()
kkkk
t
mmmm
kkkc
CCCC
kkt
Pt C Pt C n N
WWWW
Pt PC Pt n N



(1)
where k
nis the number of Web pages in which Ck occurs, t
k
nis the number of Web pages in which Ck and t co-occurs, Nc is the total
number of Web pages crawled, and Nt is the number of Web pages in which t occurs. Compared with Zheng et al. 5’s approach, this
approach is able to classify Web pages by means of the concepts in an ontology, to learn the weights of relations between concepts,
and to work in an uncontrolled Web environment thanks to the unsupervised learning paradigm. The limitations of Su et al.’s approach
are that 1) it cannot be used to enrich the vocabulary of ontologies; 2) although the unsupervised learning paradigm can work in an
uncontrolled Web environment, it may not work well when numerous new terms emerge or when ontologies have limited vocabulary.
By means of a comparative analysis of the two ontology-based focused crawlers (Table 2), we found a common limitation, which is
that none of the two crawlers is able to really evolve ontologies by enriching their contents, namely their vocabularies. It is found that
both of the approaches attempt to use learning models to deduce the quantitative relationship between the occurrence frequencies of
the concepts in an ontology and the topic, which may not be applicable in the real Web environment. When numerous unpredictable
new terms outside the scope of the vocabulary of an ontology emerge in Web pages, these approaches cannot determine the
relatedness between the new terms and the topic, and cannot make use of the new terms for the relatedness determination, which could
result in the decline in their performance. Consequently, in order to address this research issue, we propose to design the SOF crawler,
in order to precisely discover, format and index relevant Web pages in the uncontrolled Web environment.
Zheng et al.’s crawler Su et al.’s crawler SOF Crawler
Learning paradigm Supervised Unsupervised Semi-supervised
4
Classification No Yes Yes
Term learning No No Yes
Relation learning No Yes Yes
Open and heterogeneous
environment
No No Yes
Table 2 Comparative analysis of the existing ontology-learning-based focused crawlers
3. System Architecture and Components
In this section, we introduce the system architecture and the functionalities of the components of the proposed SOF crawler.
The primary objective of this crawler is to maintain the precision of the ontology-based Web page focused crawling and classification,
by 1) enriching the vocabulary of ontologies, and 2) enabling the crawler itself to work in an uncontrolled Web environment. In order
to realize this objective, we propose a semi-supervised ontology learning approach, enabling the utilized ontology to evolve itself in an
uncontrolled environment, by learning unpredictable but semantically relevant terms extracted from Web pages.
We conclude four major functions of the proposed crawler as follows: 1) downloading Web pages from the Internet; 2) generating
metadata from Web pages, in which metadata are the semantic descriptions of Web pages; 3) using ontologies to classify relevant
metadata in order to classify relevant Web pages and filter out non-relevant Web pages; and 4) enriching the vocabulary of ontologies
by means of the terms extracted from Web pages. A sketch map of the ontology-based Web page classification is shown in Fig. 1.
Fig. 1 Sketch map of the ontology-based Web page classification
It needs to be noted that this crawler is built upon the semantic focused crawling frameworks designed in our previous research work 6,
7. In our previous research work, we designed two pure semantic focused crawlers, which do not have an ontology-learning function to
automatically evolve the utilized ontologies. This research aims to remedy this defect.
The system architecture and system workflow of the proposed SOF crawler is shown in Fig. 2. Basically the SOF crawler can be
divided into three components based on the functionalities, i.e., a storage component – the knowledge base, a processing component –
the crawling and processing module, and a computing component – the semi-supervised Web page classification and ontology
learning module. In the rest of this section, we will introduce the technical details regarding the three components.
5
Crawling
Term Extraction
Ontology
Preprocessing
Term Processing
Direct Concept-
Metadata
Matching(Y/N)
Y
Metadata
Generation and
Association
SVM-based Concept-
Metadata Matching
(Y/N) FilteringN
Ontology Learning
Y
Website
Web pages
N
Ontology Base
Metadata Base
Ontology
Metadata
Semi-supervised
Web page classification
and ontology learning
Knowledge Base
Crawling and
processing
Fig. 2 System architecture and system workflow of the SOF crawler
3.1 General Schemas of Ontology and Metadata in the Knowledge Base
The knowledge base consists of two components – an ontology base and a metadata base. The ontology base is designed with the
purpose of storing formal domain knowledge, i.e. ontologies, for ontology-based Web page filtering and classification. The metadata
base is used to store the semantically annotated information (i.e. metadata) with regard to Web pages. In order to realize the ontology-
based Web page filtering and classification as well as the semantic annotation, we define the general schemas respectively for
ontology and metadata. These two schemas can be customized according to the actual domain knowledge.
For the ontologies stored in the ontology base, it is reasonable to make use of a hierarchical ontology for Web page classification, in
which concepts are linked by the class/subclass relationship. Each concept represents the conceptualization of a specific topic, which
can be associated to semantically relevant Web pages. It needs to be noted that a Web page can be associated to more than one topics.
A subclass of a concept is a subtopic or a more specific topic of the topic represented by the concept. A superclass of a concept is the
upper topic or a more generalized topic of the topic represented by the concept. Therefore, taking into account the features of the
hierarchical ontologies, we can define the general schema of ontological concepts, instead of defining the general schema of a
hierarchical ontology. In addition to the class/subclass property, we define that each concept in a hierarchical ontology contains the
following elementary properties:
A conceptDescription property is a datatype property used to store the textual descriptions of a concept, which consists of one or
more phrases or sentences. Each phrase or sentence is a description or definition of a concept, which is defined by domain
experts. This property will be used in the process of Web page classification.
A learnedConceptDescription property is a datatype property that has the similar purpose as the conceptDescription property
does. The difference between the two properties is that the former is automatically learned from Web pages by the SOF crawler.
A linkedMetadata property is an object property used to associate between a concept and a semantically relevant metadata. This
property is used to semantically index the generated metadata by means of the concepts in an ontology.
For the metadata stored in the metadata base, a metadata is the semantic descriptions of a Web page, which contains the following
elementary properties:
A pageDescription property, which is a datatype property that stores the key terms and term frequencies used to describe the
6
topics of a Web page. The contents of this property are automatically extracted from the Web page by the SOF crawler, the
process of which will be introduced in Section 3.2. This property will be used for the forthcoming concept-metadata similarity
computation.
A URL property, which is a datatype property used to store the URL of the Web page to which this metadata corresponds.
A linkedConcept property, which is the inverse property of the linkedMetadata property. This property stores the URIs of the
semantically relevant concepts of the metadata. It needs to be noted that the metadata and the concepts can have a many-to-many
relationship.
3.2 System Workflow of the Modules
In this section, we introduce the functionalities of the crawling and processing module and the supervised Web page classification and
ontology learning module, in terms of the workflow of the proposed SOF crawler.
The crawling and processing module is designed with the purpose of crawling Web pages and processing the contents of Web pages
and ontologies for forthcoming computation. As can be seen in Fig. 2, the first process in this module is preprocessing, which is to
process the contents of the conceptDescription property of each concept in the ontology for the forthcoming concept-metadata
matching, before the SOF crawler starts crawling over the Internet. This process is realized by using Java WordNet Library3 (JWNL)
to implement tokenization, part-of-speech (POS) tagging, nonsense word filtering, stemming, synonym searching, and term weighting
for the conceptDescription properties of the concepts. For the task of term weighting, each term in the conceptDescription property is
associated with a weight, in order to indicate the particularity of this term in the ontology. Here we make use of the inverse document
frequency (IDF) model (based the assumption that the less frequently a term occurs, the more particularity the term has) for the weight
calculation. The term weight will be used for the forthcoming term processing process (Fig. 4). The algorithmic presentation of this
process is presented in Fig. 3.
Input: Cj are concepts of an ontology O, each concept Cj has a group of concept descr iptions CDjh, and each concept description
CDjhhasagroupoftermsCDjhl.
Output:root,synonyms,andweightofCDjhl–Wjhl.
Procedure:
forallconceptsCjdo
forallconceptsdescriptionsCDjhofaconceptCjdo
foralltermsCDjhlinaconceptdescriptionCDjhdo
RemovepunctuationsinCDjhl;
TokenizeCDjhl;
PerformPOStaggingforCDjhlbyWordNet;
RemovewordswithoutPOStagsinCDjhl;
PerformstemmingforCDjhlbyWordNet;
FindsynonymsofCDjhlfromWordNet;
CDjhl←CDjhl∪synonymsofCDjhl;
Wjhl←log |||∀∈
|||∀∈∩∃∈;
endfor
endfor
endfor
Fig. 3 Procedure of the preprocessing process
The second and third processes are crawling and term extraction. The missions of the two processes are to download a Web page from
the Internet at one time, and to extract required information from the downloaded Web page, according to the general metadata
schema defined in Section 3.1, in order to prepare the properties for generating a new metadata. These two processes are realized by
the semantic focused crawlers designed in our previous works 6, 7, in which the extraction rules and the templates are defined by
observing common patterns in the HTML codes. By means of these two processes, nearly all the properties of the metadata are
generated, except for the pageDescription property that contains unprocessed key terms
The third process is term processing, which is to process the contents of the pageDescription property of the metadata, in order to
prepare for the forthcoming concept-metadata matching. The implementation of this process is similar to the implementation of the

3 http://sourceforge.net/projects/jwordnet/
7
preprocessing process. The major differences are that 1) the term processing process does not need the function of synonym retrieval,
due to the provision of this function in the preprocessing process and the consideration of the computing cost; and 2) the term
processing process has a term frequency counting function, which is to count the frequency of the terms in the pageDescription
property. Similarly, the terms in the pageDescription property also needs to a weight for indicating their particularity. Here a term
matching function is designed for passing the weights of ontological terms obtained in the preprocessing process, in order to reduce
the computing cost in this real-time process. By means of this term matching function, the terms in the pageDescription property are
matched with the terms occurred in the conceptDescription properties of the concepts in an ontology. If two terms are matched, the
associated weight of the matched term in the ontology will be passed to the term in the pageDescription property; otherwise the term
in the pageDescription property will be regarded as a new term and assigned the maximum valid weight for its particularity, i.e., log
(number of concepts in the ontology), in terms of the IDF algorithm. The weights of terms will be used for the following SVM-based
concept-metadata matching process (Section 4.1). The algorithmic expression of the term processing process is shown in Fig. 4.
Input:PDisthepagedescriptionpropertyofaWebpageP,andPDcontainsagroupoftermsPDi.CjareconceptsofanontologyO,each
concepthasagroupofconceptdescriptionsCDjh.EachconceptdescriptionCDjhhasagroupoftermsCDjhl.EachtermCDjhl
isassociatedwithaweightWjhl.
Output:rootsofPDi,termfrequencyofPDi–TFi,termfrequencyofCDjhl–TFjhl,andweightofPDi–Wi.
Procedure:
foralltermsPDiinPdo
RemovepunctuationsinPDi
PerformPOStaggingforPDibyWordNet;
RemovewordswithoutPOStagsinPDi;
PerformstemmingforPDibyWordNet;
TFi←FrequencyofPDi;
foralltermsCDjhlinOdo
ifCDjhlPDithen
Wi←Wjhl;
TFjhl;
endif
endfor
ifWinullthen
Wilog|||∀ ∈ ;
endif
endfor
Fig. 4 Procedure of the term processing process
The rest part of the workflow can be integrated as a semi-supervised Web page classification and ontology learning module. The
detailed procedure of this module is described as follows: first of all, the direct string matching process examines whether or not the
content of the pageDescription property of a metadata is included in the conceptDescription and learnedConceptDescription
properties of a concept. If the answer is yes, then the concept and the metadata are regarded as semantically relevant. By means of the
metadata generation and association process, the metadata can then be generated and stored in the metadata base as well as associated
with the concept. If the answer is no, a Support Vector Machine (SVM)-based concept-metadata matching process will be invoked to
check the semantic relatedness between the metadata and the concept, by using a trained SVM model to assess the semantic
relatedness between the pageDescription property of the metadata and the phrases in the conceptDescription property of the concept,
the details of which will be introduced in Section 4. If the pageDescription property of the metadata is semantically relevant to any
phrases in the conceptDescription property of the concept, the metadata and the concept are regarded as semantically relevant, and the
contents of the pageDescription property of the metadata can be regarded as a new phrase for the learnedConceptDescription property
of the concept. The metadata is thus allowed to go through the metadata generation and association process; otherwise the metadata is
regarded as semantically non-relevant to the concept. The above process is repeated until all the concepts in the ontology are
compared to the metadata. If none of the concepts is semantically relevant to the metadata, this metadata is regarded as semantically
non-relevant to the domain represented by the ontology and will be filtered out. The algorithmic expression of the above processes is
revealed in Fig. 5.
Input:Cjareconceptsofanontology,eachconcepthasagroupofconceptdescriptionsCDjhandagroupoflearnedconcept
descriptions LCDjh. Each concept descriptionCDjh hasagroup of terms CDjhl, andeach learned concept description LCDjh
8
has a group of terms LCDj hl. Each term CDjh l is associated with a weight Wjhl. P  is a Web page, P has a pagedescription
propertyPD,andPDcontainsagroupoftermsPDi.EachtermPDiisassociatedwithaweightWi.
Output:1generate a metadata MifPisrelevanttoanyconceptsCj, 2 associate thesemantically relevant concepts CjandM, and 3
updatethelearnedconceptdescriptionsLCDjhifPisnotinCDjhandLCDjh.
Procedure:
forallconceptsCjdo
foralltheconceptsdescriptionsCDjhandthelearnedconceptdescriptionsLCDjhofaconceptCjdo
ifPD≡∃CDjhtrue∩lengthofPDlengthofCDjh∪PD≡∃LCDjhthen
ifMdoesnotexistthen
GenerateametadataM;
endif
AssociatebetweenMandCjbymutuallyreferencingtheirURIs;
break;
else
SimsPD,CDjh←thesimilarityvaluebetweenPDandCDjhbyaSemantic‐basedStringMatchingAlgorithm;
SimpPD,CDjh←thesimilarityvaluebetweenPDandCDjhby a Probability‐based String Matching
Algorithm;
ifSVMSimsPD,CDjh,SimpPD,CDjh1then
ifMdoesnotexistthen
GenerateametadataM;
endif
AssociatebetweenMandCjbymutuallyreferencingtheirURIs;
LCDjh←LCDjh∪PD;//AddthepagedescriptionPDintoLCDjh
break;
endif
endif
endfor
endfor
Fig. 5 Procedure of the semi-supervised Web page classification and ontology learning module
Although the SVM-based concept-metadata matching process is a supervised process, since the inputs of the SVM model are
similarity values between two phrases in the pageDescription property of a metadata and in the conceptDescription property of a
concept (which will be introduced in Section 4), the SVM determines the semantic relatedness between two phrases based on their
semantic similarity, regardless of their actual content. The following ontology learning process, therefore, is able to learn the
uncontrolled new definitions (phrases) that may contain unpredictable new terms, based on the similarity values controlled by the
SVM model, which is thus viewed as a semi-supervised ontology learning process.
4. Support Vector Machine (SVM)-based Concept-Metadata Matching Model
In this section, we introduce the mathematical models utilized in the SVM-based concept-metadata matching process (Fig. 2). In the
supervised Web page classification and ontology learning module, if the descriptive terms extracted from a Web page, i.e., the
pageDescription property of a metadata, cannot directly match with any phrases in the conceptDescription property or
learnedConceptDescription property of a concept, the SVM-based concept-metadata matching process will be invoked to
mathematically examine their semantic relatedness. Fig. 6 indicates the workflow of the SVM-based concept-metadata matching
process. Each concept is associated with a particular SVM classifier, where the inputs of the classifier are the results of a semantic-
based string matching (SSM) algorithm and a probability-based string matching (PSM) algorithm between the concept and a metadata,
and the output of the SVM classifier is their binary semantic relatedness (relevant/non-relevant). From Section 4.1 to 4.3, we will
respectively introduce the SSM algorithm, the PSM algorithm, and the SVM classifier.
9
Fig. 6 Workflow of the SVM-based concept-metadata matching process
4.1 Semantic-based String Matching Algorithm
The key idea of the SSM algorithm is to measure the text similarity between a phrase in the conceptDescription property (abbreviated
as a concept description) of a concept and the pageDescription property (abbreviated as a page description) of a metadata, by means of
WordNet4 and a semantic similarity model.
A concept description and a page description can be regarded as two groups of terms with weights after the preprocessing and term
processing phase, in which terms in a concept description have their synonyms, and terms in a page description have their term
frequencies in the corresponding Web page. Therefore, we designed a weighted Dice’s coefficient algorithm to measure the semantic
similarity between a concept description and a page description, taking into account the requirements of high precision and short
response time for focused crawling. The mathematical expression of the weighted Dice’s coefficient algorithm is presented as follows:
4.2 Probability-based String Matching Algorithm
The PSM algorithm is a complementary solution for measuring the relevance between concepts and metadata, by measuring the co-
occurrence frequencies of a page description of a metadata and a concept description of a concept in the crawled Web pages, based on
a probabilistic model 22. In the crawling process and the subsequent processes indicated in Fig. 2, the SASF crawler downloads k Web
pages at the beginning, and automatically obtains the statistical data from the k Web pages, in order to compute the relevance between
the page description (PDi) of a metadata and a concept description (CDj,h) of a concept (Cj). The PSM algorithm follows a
unsupervised training paradigm, which aims at seeking the maximum probability that CDj,h and PDi co-occur in the trained Web pages.
A graphical representation of the PSM algorithm is shown in Fig. 8. The PSM algorithm is mathematically expressed as follows:
, ,
,,
,
,,,, ,
maxSim ( , ) max [ ( | ) ( | )] max
jj jj
jj
jh i
Pi jh j jh j i
CD C CD C jh i
nn
PD CD P CD CD P CD PD nn
 



(3)
where CDj,θ is a concept description of Cj, ,
,
j
h
n
is the number of Web pages that contain both CDj,θ and CDj,h, nj,h is the number of Web
pages that contain CDj,h,,j
i
n
is the number of Web pages that contain both CDj,θ and PDi, and ni is the number of Web pages that
contain PDi.
maxSim ( , 2) max[ ( | 2) ( | )]
PCDj C
PD CD P CDj CD P CDj PD
Fig. 7. Graphical representation of the PSM algorithm
4.3 SVM Classification Algorithm
The SVM classifier for each concept is designed to best aggregate the results of the SSM algorithm and the PSM algorithm in order to
decide on the semantic relatedness between a concept description and a page description, through a supervised training paradigm. This

4 http://wordnet.princeton.edu/
10
classifier provides a binary classification function (relevant/non-relevant), which is characterized by a hyperplane in a given feature
space.
Let X = [0, 1]×[0, 1] be the feature space with feature vectors xi = (maxSimS(PDi, CDj,h), maxSimP(PDi, CDj,h)), in which the features
respectively represent the results of the SSM algorithm and the PSM algorithm. The yi value of the training set equals -1 for a pair of
semantically non-relevant concept description and page description, and 1 for the relevant pair. The yi values in the training set are
subjectively defined by domain experts. Eventually, the input of each SVM classifier is a set of training tuples {(x1, y1), …,(xm, ym)}
with xiX and yi{-1, 1}.
The result of a SVM is a maximum-margin hyperplane, which separates training examples in the feature space as precise as possible,
while the distance of the closest members on each side is maximized. This is expressed in the following optimization problem:
1
1
minimize , , : ,
2
N
T
i
i
wb w w C
(4)
where w and b describe the optimal hyperplane. The error term 1
N
i
i
C
is introduced to allow for outliers in a non-linear separable
training set, where i
is a slack variable and the penalty parameter C controls the trade-off between i
and the size of the margin.
is
a predefined function which maps features into a higher dimensional space, and a kernel function is required to reduce the
computational load in this process. In this experiment, we employed the radial basis function (RBF) as the kernel, as the number of
instances is far larger than the number of features. The RBF kernel function is defined as: 2
||
(, ) , 0
ij
xx
ij
Kx x e


. We conducted
the v-fold cross validation and grid-search approach proposed in 23 in order to find the optimal C and γ. The theoretical details of SVM
can be referenced from 24.
5. System Implementation and Evaluation
In this section, in order to systematically evaluate the framework of the proposed SOF crawler, we implement a prototype of this
crawler, and compare the performance of the crawler with the existing work reviewed in Section 2, based on several performance
indicators adopted from the information retrieval (IR) field.
5.1 Prototype Implementation and Test Environment Setup
The overall framework of the SOF crawler is built in Java within the platform of Eclipse 3.7.15. The general ontology schema and
general metadata schema is implemented in OWL-DL within the platform of Protégé 3.4.76. The OWL API7 is utilized to access the
OWL file and the libSVM8 Java library is utilized for the implementation of the SVM classifiers. With the purpose of comparatively
analyzing between our work and the existing work, i.e., Zheng et al.’s and Su et al.’s ontology learning-based semantic focused
crawlers, we implement a prototype for each crawler in Java, in which the ANN model used by Zheng et al.’s crawler is built in
Encog9.
The test environment is initialized by two tasks: (1) the selection of a candidate ontology for ontology-based focused crawling and/or
classification, and ontology learning, and (2) the selection of Web pages for crawling, training, and testing. For the first task, we use a
previously designed mining service ontology, which represents the domain knowledge in the mining service industry. This mining
service ontology follows a four-level hierarchical structure, and consists of 158 concepts, in which each concept is defined by
following the general schema of ontological concepts introduced in Section 3.1. The mining service ontology is mostly referenced
from Wikipedia10, Australian Bureau of Statistics11, and the websites of nearly 200 Australian and international mining service
companies. The details of the mining service ontology can be referenced from 22. For Task 2, as mentioned in Section 2, one common
defect of the existing ontology-learning-based focused crawlers is that, these crawlers cannot keep their performance in an

5 http://www.eclipse.org/
6 http://protege.stanford.edu/
7 http://owlapi.sourceforge.net/
8 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
9 http://code.google.com/p/encog-java/
10 http://en.wikipedia.org/
11 http://www.abs.gov.au/
subject to 1 : ( ( ) ) 1 , 0,
T
ii ii
iNyw x b

   
11
uncontrolled Web environment with unpredictable new terms, due to the limitations of the adopted ontology learning approaches.
Hence, our proposed SOF crawler aims to remedy this defect, by following a semi-supervised Web page classification and ontology
learning paradigm. In order to evaluate our crawler and the existing crawlers in an uncontrolled Web environment, we choose two
mainstream mining service advertising websites – Australian Kompass 12 (abbreviated as Kompass below) and Australian
Yellowpages®13 (abbreviated as Yellowpages® below), as the testing data source. There are around 800 downloadable mining-related
service or product advertisements registered in Kompass, and around 3200 similar advertisements registered in Yellowpages®, all of
which are published in English. Since Zheng et al’s crawler needs a supervised training process, Su et al.’s crawler needs an
unsupervised training process, and our proposed SOF crawler needs both supervised and unsupervised training processes, we label the
Web pages from Kompass, and use these Web pages as the training set for all of these crawlers. Subsequently, we test and compare
the performance of these crawlers on the task of crawling and classifying the Web pages from Yellowpages®, based on the
performance indicators introduced in the next section, with the purpose of evaluating their capability in this heterogeneous
environment.
5.2 Performance Indicators
We define the following parameters for comparing between our crawler and the existing ontology-learning-based focused crawlers.
All the indicators are adopted from the field of IR and need to be redefined in order to be applied in the scenario of ontology-based
focused crawling.
Harvest rate is used to measure the harvesting ability of a crawler. Harvest rate for a crawler ε after crawling μ Web pages is defined
as follows:
||
()||
HR
(5)
where||
is the number of associated metadata from the Web pages, and||
is the number of generated metadata from the
Web pages.
Precision is used to measure the preciseness of a crawler. Precision for a concept Cj after crawling μ Web pages is defined as follows:
|{ | }|
() ||
ijij
j
j
PC


 
(6)
where
j
is the set of associated metadata from the Web pages for Cj,||
j
is the number of associated metadata from the Web
pages for Cj, and
j
is the set of relevant metadata from the Web pages for Cj. It needs to be noted that the set of relevant metadata
need to be manually identified by peers before the evaluation.
Recall is used to measure the effectiveness of a crawler. Recall for a concept Cj after crawling μ Web pages is defined as follows:
|{ | }|
() ||
ijij
j
j
RC


 
(7)
where||
j
is the number of the relevant metadata from the Web pages for Cj.
Harmonic mean is used to measure the aggregated performance of a crawler. Harmonic mean for a concept Cj after crawling μ Web
pages is defined as follows:
() ()
() ()()
j
j
j
jj
PC RC
HM C PC RC

(8)

12 http://au.kompass.com/
13 http://www.yellowpages.com.au/
12
Fallout is used to measure the inaccuracy of a crawler. Fallout for a concept Cj after crawling μ Web pages is defined as follows:
|{ | }|
() ||
j
iji
j
j
FC

 
(9)
where
j
is the set of non-relevant metadata from the Web pages for Cj, and||
j
is the number of non-relevant metadata from the
Web pages for Cj. It needs to be noted that the set of non-relevant metadata need to be manually identified by peers before the
evaluation
Crawling time is used to measure the efficiency of a crawler. Crawling time of the SOF crawler for a Web page is defined as the time
interval of processing the Web page from the crawling process to the metadata generation and association process or to the filtering
process, as shown in Fig. 2.
5.3 System Evaluation
In this section, we evaluate the feasibility of the SOF crawler, by comparing its performance with the existing ontology-learning-based
focused crawlers, i.e., Zheng et al.’s crawler and Su et al.’s crawler introduced in Section 2. We compare the performance of the three
crawlers based on the six parameters introduced in Section 5.2. Since Zheng et al.’s crawler does not have the function of
classification, we only obtain its performance data on harvest rate and crawling time.
Fig. 8. Comparison of the SOF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on harvest rate
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumbrtofCrawledWebPages
HarvestRate
SOF Su Zheng
13
Fig. 9. Comparison of the SOF crawler and Su et al.’s crawler on precision
Fig. 10. Comparison of the SOF crawler and Su et al.’s crawler on recall
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
200 400 600 800 1000 1200 1400 160 0 1800 2000 2200 2400 260 0 2800 3000 3200
NumbrtofCrawledWebPages
Precision
SOF Su
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
Recall
SOF Su
14
Fig. 11. Comparison of the SOF crawler and Su et al.’s crawler on Harmonic Mean
Fig. 12. Comparison of the SOF crawler and Su et al.’s crawler on fallout rate
The performance of the SIF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on crawling time is shown in Fig. 14. It can be seen
that initially there is no big difference among the three crawlers. Su et al.’s crawler takes the highest crawling time. Since Zheng et
al.’s crawler does not execute the task of classification, it needs less crawling time.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
HarmonicMean
SOF Su
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
200 400 600 800 1000 1200 1400 1600 1800 200 0 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
FalloutRate
SOF Su
15
Fig. 13. Comparison of the SOF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on crawling time
6. Conclusion and Future Works
In this paper, we presented the framework of an innovative semi-supervised ontology-learning-based focused crawler – the SOF
crawler, in order to maintain the performance of ontology-based semantic focused crawling in an open and heterogeneous Web
environment. Due to the limitations in the adopted ontology learning approaches, the existing ontology-learning-based focused
crawlers cannot work in an uncontrolled Web environment where contains unpredictable new terms. Hence, in the framework of the
SOF crawler, we proposed a semi-supervised ontology learning approach, which enables the utilized ontological concepts to
automatically learn new definitions from the semantically relevant Web information, while keeping its performance in focused
crawling and classification. A semantic-based string matching algorithm and a probability-based string matching algorithm were
designed to measure the semantic relatedness between ontological concepts and Web-information-generated metadata, respectively
from the perspectives of semantic similarity and statistical data. A SVM model was trained to eventually determine the binary
relatedness (relevant/non-relevant) between a pair of concept and metadata, by aggregating the results from the two algorithms. In
order to evaluate the research outcome, we built the prototypes of the SOF crawler and two existing ontology-learning-based focused
crawlers. Next, we tested and compared their performance based on several IR indicators, in a simulated heterogeneous Web
environment. The comparison results preliminarily prove the feasibility and technical advantages of the proposed SOF crawler.
For the future work, we will focus on the following research tasks in the area of ontology-learning-based focused crawling: 1) we will
try to incorporate other ontology learning approaches into this framework in order to achieve better performance for ontology-based
focused crawling and classification; and 2) we will test the performance of this framework in other domains by developing new topical
ontologies or modifying the existing topical ontologies according to the defined general schema of ontological concepts.
References
1Batzios,A.,Dimou,C.,Symeonidis,A.L.,andMitkas,P.A.:‘BioCrawler:Anintelligentcrawlerforthesemanticweb’,Expert
SystemswithApplications,2008,35,(1–2),pp.524530
2Aggarwal,C.C.,AlGarawi,F.,andYu,P.S.:‘IntelligentcrawlingontheWorldWideWebwitharbitrarypredicates’.Proc.
Proceedingsofthe10thinternationalconferenceonWorldWideWeb(WWW'01),NewYork,NY,USA2001pp.96105
3Chakrabarti,S.,Berg,M.v.d.,andDom,B.:Focusedcrawling:anewapproachtotopicspecificWebresourcediscovery’.
Proc.ProceedingsoftheeighthinternationalconferenceonWorldWideWeb(WWW'99),NewYork,NY,USA1999pp.16231640
4Su,C.,Gao,Y.,Yang,J.,andLuo,B.:‘Anefficientadaptivefocusedcrawlerbasedonontologylearning’.Proc.Proceedingsof
theFifthInt.Conf.onHybridIntelligentSyst.(HIS'05)RiodeJaneiro,Brazil,69Nov.20052005pp.7378
5Zheng,H.T.,Kang,B.Y.,andKim,H.G.:‘Anontologybasedapproachtolearnablefocusedcrawling’,Inform.Sciences,
2008,178,(23),pp.45124522
6Dong,H.,andHussain,F.K.:‘Focusedcrawlingforautomaticservicediscovery,annotation,andclassificationinindustrial
digitalecosystems’,IEEETrans.Ind.Electron.,2011,58,(6),pp.21062116
0
50000
100000
150000
200000
250000
300000
350000
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
CrawlingTime(Unit:ms)
SOF Su Zheng
16
7Dong,H.,Hussain,F.K.,andChang,E.:‘Aframeworkfordiscoveringandclassifyingubiquitousservicesindigitalhealth
ecosystems’,J.ofComput.andSyst.Sci.,2011,77,(4),pp.687704
8Ehrig,M.,andMaedche,A.:‘OntologyfocusedcrawlingofWebdocuments’.Proc.ProceedingsoftheEighteenthAnnual
ACMSymposiumonAppliedComputing(SAC2003),Melbourne,USA2003pp.912
9Ganesh,S.,Jayaraj,M.,Kalyan,V.,andAghila,G.:‘OntologybasedWebcrawler’.Proc.Proceedingsofthe2004
InternationalConferenceonInformationTechnology:CodingandComputing(ITCC'04),LasVegas,USA2004pp.337341
10Halkidi,M.,Nguyen,B.,Varlamis,I.,andVazirgiannis,M.:‘THESUS:OrganizingWebdocumentcollectionsbasedonlink
semantics’,TheVLDBJournal,2003,12,(4),pp.320–332
11Huang,W.,Zhang,L.,Zhang,J.,andZhu,M.:‘Semanticfocusedcrawlingforretrievingecommerceinformation’,Journalof
Software,2009,4,(5),pp.436443
12Yuvarani,M.,Iyengar,N.C.S.N.,andKannan,A.:‘LSCrawler:aframeworkforanenhancedfocusedWebcrawlerbasedon
linksemantics’.Proc.Proceedingsofthe2006IEEE/WIC/ACMInternationalConferenceonWebIntelligence(WI'06),HongKong2006
pp.794800
13Toch,E.,Gal,A.,ReinhartzBerger,I.,andDori,D.:‘Asemanticapproachtoapproximateserviceretrieval’,ACM
TransactionsonInternetTechnology,2007,8,(1),pp.231
14Can,A.B.,andBaykal,N.:‘MedicoPort:Amedicalsearchengineforall’,ComputerMethodsandProgramsinBiomedicine,
2007,86,(1),pp.7386
15Cesarano,C.,d'Acierno,A.,andPicariello,A.:‘Anintelligentsearchagentsystemforsemanticinformationretrievalonthe
internet’.Proc.ProceedingsoftheFifthInternationalWorkshoponWebInformationandDataManagement(WIDM'03),New
Orleans,USA2003pp.111117
16Batzios,A.,Dimou,C.,Symeonidis,A.L.,andMitkas,P.A.:‘BioCrawler:AnintelligentcrawlerfortheSemanticWeb’,Expert
SystemswithApplications,2008,35,(12),pp.524530
17Liu,H.,Milios,E.,andJanssen,J.:‘ProbabilisticmodelsforfocusedWebcrawling’.Proc.Proceedingsofthe6thAnnualACM
InternationalWorkshoponWebInformationandDataManagement(WIDM'04),WashingtonD.C.,USA2004pp.1622
18Dong,H.,Hussain,F.,andChang,E.:‘Stateoftheartinsemanticfocusedcrawlers’,inGervasi,O.,Taniar,D.,Murgante,B.,
Lagana,A.,Mun,Y.,andGavrilova,M.(Eds.):‘ComputationalSci.andItsApplicat.‐ICCSA2009’(SpringerBerlin/Heidelberg,2009),
pp.910924
19Gruber,T.R.:‘Atranslationapproachtoportableontologyspecifications’,KnowledgeAcquisition,1993,5,(2),pp.199220
20Wong,W.,Liu,W.,andBennamoun,M.:‘Ontologylearningfromtext:Alookbackandintothefuture’,ACMComputing
Surveys,2011,X,(X),pp.Toappear
21Rennie,J.,andMcCallum,A.:‘UsingreinforcementlearningtospidertheWebefficiently’.Proc.ProceedingsoftheSixteenth
Int.Conf.onMach.Learning(ICML'99),Bled,Slovenia1999pp.335343
22Dong,H.,andHussain,F.K.:‘Selfadaptivesemanticfocusedcrawlerforminingservicesinformationdiscovery’,IEEETrans.
Ind.Informat.,2012,Submitted
23Hsu,C.W.,Chang,C.C.,andLin,C.J.:‘Apracticalguidetosupportvectorclassification’,inEditor(Ed.)^(Eds.):‘BookA
practicalguidetosupportvectorclassification’(DepartmentofComputerScienceandInformationEngineering,NationalTaiwan
University,2007,edn.),pp.
24Boser,B.E.,Guyon,I.M.,andVapnik,V.N.:‘Atrainingalgorithmforoptimalmarginclassifiers’.Proc.Proceedingsofthefifth
annualworkshoponComputationallearningtheoryPittsburgh,Pennsylvania,UnitedStates1992pp.144152
... Zheng et al. [14] proposed a focused crawler by integrating Unified Medical Language System (UMLS) ontology with the ANN to identify the topical similarity of the web page. Dong et al. [15] proposed a crawler by computing Resnik and the statistical similarity score of the web page to form the feature vectors which are used to train the SVM to determine the topical similarity of the web page. Capuano et al. [16] proposed a crawler by integrating both the text and multimedia contents. ...
... The text sequences of the identified input web page are converted into input embedding vectors by the pre-trained word2vec model. These input embedding vectors are added with positional encoding vectors calculated by the Eqs.(15) and(16)to compute the positional embedding vectors. where pos is the position of the word in the input web page sequence and 1 ≤ i ≤ d is the dimension of positional encoding vectors. ...
Article
Sentiment relevant information in the web pages concerning products, establishment, and commodities concentrates principally on the available textual contents. Research on crawling topic-relevant web pages is far behind compared to sentiment-relevant web pages despite the steep rise in sentiment-relevant information on the web. This paper resolves the impediment issues and proposes a novel focused web crawler namely the Linguistic Semantic Sentiment (LSS) crawler which collects not only topic-relevant web pages but also sentiment-relevant web pages. Two classifiers are proposed in the relevance computation module of the LSS crawler, where one is a linguistic semantic classifier and the other is a sentiment classifier. The linguistic semantic classifier computes the semantic relevance of the web page concerning the topic, whereas the sentiment classifier computes the sentiment relevance of the web page. The performance of the LSS crawler is then analyzed by using the metrics, harvest rate, target recall, and F1-score. The LSS crawler outperformed the existing focused crawlers with an average harvest rate of 0.35, target recall of 0.55, and F1-score of 0.42. The evaluation results revealed that both the linguistic semantic and the sentiment classifiers enhanced the performance of the proposed LSS-focused crawler.
... "To overcome this challenge, ontology learning-based approaches [11][12][13][14] have been introduced by combining both the ontology and the supervised learning methodologies [34][35][36][37]. This type of crawler needs a huge dataset to train machine learning algorithms [38,39]. ...
Article
Full-text available
Focused crawler (FC) is a web crawler that downloads only relevant web pages for a given topic. The main source of biomedical information is now the Internet. The volume, pace, variety, and caliber of online biomedical information, however, pose difficulties and necessitate ameliorated facilitation methods for biological information to crawl. The search engine must have an efficacious, targeted crawler mechanism in order to retrieve precise biomedical information. To address these challenges a new FC is proposed using Gaussian support vector regression to calculate the importance of the web page. The synonym computation of the topic term using popular biomedical ontology unified medical language system helps the proposed crawler to improve the performance of relevance computation module. The newly designed crawler outperforms existing crawlers with an average harvest rate hrate\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {h_{{{\text{rate}}}} } \right)$$\end{document} of 0.37 and an average irrelevance ratio prate\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {p_{{{\text{rate}}}} } \right)$$\end{document} of 0.63 after 5000 webpage crawls on biomedical topics. Experimental results reveal the proposed FC improved performance of focused crawling for biomedical topics in crawling environment.
... By using the idea of artificial intelligence, the intelligent focused crawler has a certain learning ability. The intelligent method can effectively guide the focused crawler to approach to web pages related to the given topic and be away from web pages unrelated to the given topic [31,32]. ...
Article
Full-text available
The focused crawler downloads web pages related to the given topic from the Internet. In many research studies, most of focused crawler predict the priority values of unvisited hyperlinks by integrating the topic similarities based on the text similarity model and equivalent weighted factors based on the manual method. However, in these focused crawlers, there are flaws in the text similarity models, and weighted factors are arbitrarily determined for calculating priorities of unvisited URLs. To solve these problems, this paper proposes a semantic and intelligent focused crawler based on the Semantic Vector Space Model (SVSM) and the Membrane Computing Optimization Algorithm (MCOA). Firstly, the SVSM method is used to calculate topic similarities between texts and the given topic. Secondly, the MCOA method is used to optimize four weighted factors based on the evolution rules and the communication rule. Finally, this proposed focused crawler predicts the priority of each unvisited hyperlink by integrating the topic similarities of four texts and the optimal four weighted factors. The experiment results indicate that the proposed SVSM-MCOA Crawler improve the evaluation indicators compared with the other four focused crawlers. In conclusion, the proposed SVSM and MCOA method promotes the focused crawler to have semantic understanding and intelligent learning ability.
... Ontology learning has been conducted manually by a limited number of domain experts with the aid of tools like Protégé [30] or OntoEdit [31] in some works [32], [33]. Another set of approaches [34]- [36] use human input to either prepare the input for automatic algorithms or to verify and amend their outputs. The idea of using the knowledge of the crowd (i.e., crowdsourcing) for ontology learning has also been studied in the literature [37]- [39]. ...
Article
Full-text available
The quality of event log data is a constraining factor in achieving reliable insights in process mining. Particular quality problems are posed by activity labels which are meant to be representative of organisational activities, but may take different manifestations (e.g. as a result of manual entry synonyms may be introduced). Ideally, such problems are remedied by domain experts, but they are time-poor and data cleaning is a time-consuming and tedious task. Ontologies provide a means to formalise domain knowledge and their use can provide a scalable solution to fixing activity label similarity problems, as they can be extended and reused over time. Existing approaches to activity label quality improvement use manually-generated ontologies or ontologies that are too general (e.g. WordNet). Limited attention has been paid to facilitating the development of purposeful ontologies in the field of process mining. This paper is concerned with the creation of activity ontologies by domain experts. For the first time in the field of process mining, their participation is facilitated and motivated through the application of techniques from crowdsourcing and gamification. Evaluation of our approach to the construction of activity ontologies by 35 participants shows that they found the method engaging and that its application results in high-quality ontologies.
... Hai Dong et al. [18] designed a semi-supervised ontology learningbased approach for focused web crawling. This work extracts the Resnik semantic similarity score [29] and the statistical-based cooccurrence similarity score between the topic and the web page contents as features. ...
Article
Full-text available
Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58.
... This follows the observation that relevant documents will preferentially link to other relevant documents ("topical locality" [1]). Extensions of this model use ontologies to incorporate semantic knowledge into the matching process [14,15], 'tunnel' between disjoint page clusters [5,35] or learn navigation structures necessary to find relevant pages [12,26]. In time-aware focused crawling [34], the document or event time is used as the primary focusing criterion. ...
Article
Full-text available
Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.
Article
Full-text available
Irrelevant search results for a given topic end up wasting search engine users' time. A learning focused web crawler downloads relevant URLs for a given topic using machine‐learning algorithms. The dynamic nature of the web is a challenge in related computation for focused web crawlers. Studies have shown that the learning focused crawler utilizes term frequency‐inverse document frequency (TF‐IDF) to compute the relevance between a web page and a given topic. The TF‐IDF detects similarity of the given topic to its co‐occurrence on the web page. The necessity of efficient mechanism to compute the relevance of URLs syntactically and semantically has led to the proposal of this paper with a word embedding approach to compute the relevance of the web page. The global vector representation cosine similarity is calculated between a topic and the web page contents. The calculated cosine similarity is provided as input to the trained random forest classifier to predict the relevancy of the web page. The evaluation results proved that the proposed crawler produced an average hrate of 0.41 and prate of 0.59, which outperformed other learning‐focused crawlers on support vector machines, Naive Bayes and artificial neural networks.
Article
Full-text available
Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, these studies first create a DOM tree and then access the necessary data through this tree. The construction process of this tree increases the time cost depending on the data structure of the DOM Tree. In the current web scraping literature, it is observed that time efficiency is ignored. This study proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree. The string methods consist of the following consecutive steps: searching for a given pattern, then calculating the number of closing HTML elements for this pattern, and finally extracting content for the pattern. In the crawling process, our approach collects the additional information, including the starting position for enhancing the searching process, the number of inner tag for improving the extraction process, and tag repetition for terminating the extraction process. The string methods of this novel approach are about 60 times faster than extracting with the DOM-based method. Moreover, using these additional information improves extraction time by 2.35 times compared to using only the string methods. Furthermore, this approach can easily be adapted to other DOM-based studies/parsers in this task to enhance their time efficiencies.
Chapter
In this research we present Mirkwood, a parallel crawler for fast and online syntactic analysis of websites. Configured by default to behave as a focused crawler, analysing exclusively a limited set of hosts, it includes seed extraction capabilities, which allows it to autonomously obtain high quality sites to crawl. Mirkwood is designed to run in a computer cluster, taking advantage of all the cores of its individual machines (virtual or physical), although it can also run on a single machine. By analysing sites online and not downloading the web content, we achieve crawling speeds several orders of magnitude faster than if we did, while assuring that the content we check is up to date. Our crawler relies on MPI, for the cluster of computers, and threading, for each individual machine of the cluster. Our software has been tested in several platforms, including the Supercomputer Calendula. Mirkwood is entirely written in Java language, making it multi–platform and portable.
Conference Paper
Full-text available
Online advertising has become increasingly popular among SMEs in service industries, and thousands of service advertisements are published on the Internet every day. However, there is a huge barrier between service-provider-oriented service information publishing and service-customer-oriented service information discovery, which causes that service consumers hardly retrieve the published service advertising information from the Internet. This issue is partly resulted from the ubiquitous, heterogeneous, and ambiguous service advertising information and the open and shoreless Web environment. The existing research, nevertheless, rarely focuses on this research problem. In this paper, we propose an ontology-learning-based focused crawling approach, enabling Web-crawler-based online service advertising information discovery and classification in the Web environment, by taking into account the characteristics of service advertising information. This approach integrates an ontology-based focused crawling framework, a vocabulary-based ontology learning framework, and a hybrid mathematical model for service advertising information similarity computation.
Article
Full-text available
It is well recognized that the Internet has become the largest marketplace in the world, and online advertising is very popular with numerous industries, including the traditional mining service industry where mining service advertisements are effective carriers of mining service information. However, service users may encounter three major issues – heterogeneity, ubiquity, and ambiguity, when searching for mining service information over the Internet. In this paper, we present the framework of a novel self-adaptive semantic focused crawler – SASF crawler, with the purpose of precisely and efficiently discovering, formatting, and indexing mining service information over the Internet, by taking into account the three major issues. This framework incorporates the technologies of semantic focused crawling and ontology learning, in order to maintain the performance of this crawler, regardless of the variety in the Web environment. The innovations of this research lie in the design of an unsupervised framework for vocabulary-based ontology learning, and a hybrid algorithm for matching semantically relevant concepts and metadata. A series of experiments are conducted in order to evaluate the performance of this crawler. The conclusion and the direction of future work are given in the final section.
Article
Full-text available
Ontologies are often viewed as the answer to the need for interoperable semantics in modern information systems. The explosion of textual information on the Read/Write Web coupled with the increasing demand for ontologies to power the Semantic Web have made (semi-)automatic ontology learning from text a very promising research area. This together with the advanced state in related areas, such as natural language processing, have fueled research into ontology learning over the past decade. This survey looks at how far we have come since the turn of the millennium and discusses the remaining challenges that will define the research directions in this area in the near future.
Article
Full-text available
To support the sharing and reuse of formally represented knowledge among AI systems, it is useful to define the common vocabulary in which shared knowledge is represented. A specification of a representational vocabulary for a shared domain of discourse—definitions of classes, relations, functions, and other objects—is called an ontology. This paper describes a mechanism for defining ontologies that are portable over representation systems. Definitions written in a standard format for predicate calculus are translated by a system called Ontolingua into specialized representations, including frame-based systems as well as relational languages. This allows researchers to share and reuse ontologies, while retaining the computational benefits of specialized implementations.We discuss how the translation approach to portability addresses several technical problems. One problem is how to accommodate the stylistic and organizational differences among representations while preserving declarative content. Another is how to translate from a very expressive language into restricted languages, remaining system-independent while preserving the computational efficiency of implemented systems. We describe how these problems are addressed by basing Ontolingua itself on an ontology of domain-independent, representational idioms.
Article
Full-text available
The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a pages classification is enriched by the detection of its incoming links semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages incoming links, and converts them to semantics by mapping them to a domains ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.
Article
Full-text available
Digital Ecosystems make use of Service Factories for service entities' publishing, classification, and management. However, before the emergence of Digital Ecosystems, there existed ubiquitous and heterogeneous service information in the Business Ecosystems environment. Therefore, dealing with the preexisting service information becomes a crucial issue in Digital Ecosystems. This issue has not been addressed previously in the literature. In order to resolve this issue, in this paper, we present a conceptual framework for a semantic focused crawler, with the purpose of automatically discovering, annotating, and classifying the service information with the Semantic Web technologies. The technical and evaluation details of the framework are also presented and discussed in this paper.
Article
Focused crawling is aimed at selectively seeking out pages that are relevant to a predefined set of topics. Since an ontology is a well-formed knowledge representation, ontology-based focused crawling approaches have come into research. However, since these approaches utilize manually predefined concept weights to calculate the relevance scores of web pages, it is difficult to acquire the optimal concept weights to maintain a stable harvest rate during the crawling process. To address this issue, we proposed a learnable focused crawling framework based on ontology. An ANN (artificial neural network) was constructed using a domain-specific ontology and applied to the classification of web pages. Experimental results show that our approach outperforms the breadth-first search crawling approach, the simple keyword-based crawling approach, the ANN-based focused crawling approach, and the focused crawling approach that uses only a domain-specific ontology.