Content uploaded by Hai Dong
Author content
All content in this area was uploaded by Hai Dong on Jul 03, 2018
Content may be subject to copyright.
1
SOF: A Semi-supervised Ontology-learning-based Focused Crawler 1
Hai Dong^, Farookh Khadeer Hussain*, and Elizabeth Chang^
^School of Information Systems, Curtin Business School, Curtin University of Technology, Perth, WA 6845, Australia
*School of Software, Faculty of Engineer and Information Technology, University of Technology, Sydney, Ultimo, NSW 2007,
Australia
Abstract—The dynamic innovation of Internet technologies drives the fast growth of data volume on the Web, which makes it
increasingly impractical for a crawler to index the whole Web. Instead, many intelligent crawlers, known as ontology-based semantic
focused crawlers, are designed, by making use of Semantic Web technologies for topic-centered Web information crawling.
Ontologies, however, have the constraints of validity and time, which may influence the performance of the crawlers. Ontology
learning-based focused crawlers are therefore designed to automatically evolve ontologies by integrating ontology learning
technologies. Nevertheless, the survey indicated that the existing ontology-learning-based focused crawlers do not have the capability
to automatically enrich the content of ontologies, which makes the crawlers unreliable in the open and heterogeneous Web
environment. Hence, in this paper, we proposed the framework of a novel semi-supervised ontology learning-based focused crawler –
the SOF crawler, which embodies a series of schemas for ontology generation and Web information formatting, a semi-supervised
ontology learning framework, and a hybrid Web page classification approach aggregated by a SVM model. A series of tests are
implemented to evaluate the technical feasibility of this proposed framework. The conclusion and the future work are summarized in
the final section.
Keywords—ontology-learning-based-focused crawler, ontological term learning, probabilistic model, semantic focused crawler,
semi-supervised ontology learning, semantic similarity model, support vector machine.
1. Introduction
With broadbandization of networks, popularization of multimedia broadcasting, and fusion of social networks and digital services, we
are living in the era of information explosion. According to a study2 conducted by IDC, in 2011, the amount of information created
and replicated surpassed 1.8 zettabytes (1.8 trillion gigabytes), which was 9 times as many as it of five years ago. Therefore, it is not
difficult to understand how hard to collect required information on the Web. Although popular search engines use bots, commonly
referred to as crawlers or spiders, to traverse the Web and index the Web information. However, due to the dynamic growth of the
Web, it is becoming increasingly impractical for crawlers to index the whole Web1. Instead, many intelligent crawlers, known as
topical/focused crawlers, are designed to find Web pages of a particular kind or on a particular topic, by avoiding hyperlinks that lead
to off-topic areas, and by concentrating on links to Web pages of interest 2, 3. Nevertheless, since the topical information used for
focused crawling is described by plain texts, there exists an ambiguity issue in the topical information, as a result of the nature of
natural languages. This issue may further result in the ambiguous crawling boundaries and the low precision of focused crawling. One
solution of this issue is to apply the background knowledge of crawling topics to focused crawling. Ontology, as a form of formal
representation of domain-specific knowledge, can be used in focused crawling to semantically define topical boundaries and enhance
the crawling precision. As ontologies are created by domain experts in terms of domain experts’ worldview, in order to represent the
current knowledge in a domain, two questions arise, which are, 1) does an ontology really reflect the knowledge in a domain in the
real world, and 2) how long can an ontology reflect the knowledge in a domain in the real world? With the consideration of the
validity and the time constraint of ontologies, several ontology-learning-based focused crawlers 4, 5 are proposed, in order to evolve
ontologies and keep their high performance in the focused crawling process. However, based on a survey in this research, it is found
that the existing methodologies in this area cannot guarantee their performance in an open and heterogeneous Web environment where
numerous unpredictable new terms emerge in Web pages, since the existing methodologies do not have the functionality of
ontological term learning.
Based on the above factor, in this paper, we propose the framework of a novel ontology-learning-based focused crawler – SOF crawler,
in order to realize the ontological term learning, and the high performance ontology-based focused crawling and Web page
classification, in an open and heterogeneous Web environment. By means of this crawler, crawling topics are represented by
ontological concepts, and metadata are built on Web pages in order to semantically describe their content. In addition, this crawler
1 This is a preprint version of the paper: Dong, H., Hussain, F.K.: SOF: A semi-supervised ontology-learning-based focused crawler. Concurrency and Computation:
Practice and Experience 25(12) (August 2013) pp. 1755-1770. Download link: http://onlinelibrary.wiley.com/doi/10.1002/cpe.2980/abstract
2 http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm
2
contains a semi-supervised ontology learning framework, enabling the continuous enrichment of the definitions of ontological
concepts and thus maintaining its performance in the focused crawling and Web page classification process in an open and
heterogeneous Web environment. In the semi-supervised ontology learning framework, a semantic similarity model and a probabilistic
model are designed to respectively measure the similarity between crawling topics and Web pages from different perspectives. A
Support Vector Machine (SVM) model is eventually trained to aggregate the results of the two models, in order to determine the
semantic relevance between crawling topics and Web pages.
The reminder of this paper is organized as follows. In Section 2, we briefly introduce the research areas of semantic focused crawling
and ontology-learning-based focused crawling, and review the previous work in the area of ontology learning-based focused crawling.
In Section 3 we introduce the system architecture and the functionalities of the components of the proposed SOF crawler, including
the general schemas for topical ontology building and Web page metadata generation, and the workflow of the semi-supervised
ontology learning and Web page classification process. In Section 4, we introduce the mathematical models employed in the semi-
supervised ontology learning and Web page classification process. In Section 5, we reveal the prototype implementation details of this
crawler, and evaluate its technical advantages by comparing its performance with the performance of the existing ontology-learning-
based focused crawlers. In Section 6, we summarize the technical features of this SOF crawler and draw our future research directions
in the area of ontology-learning-based focused crawling.
2. Related Works
In this section, we briefly introduce the research areas of semantic focused crawling and ontology-learning-based focused crawling,
and review the previous work in the area of ontology learning-based focused crawling.
A semantic focused crawler is a software agent that is able to traverse the Web, and retrieve as well as download related Web
information for specific topics, by means of semantic Web technologies 6, 7. The goal of semantic focused crawlers is to precisely and
efficiently retrieve and download relevant Web information by understanding the semantics underlying the Web information and the
semantics underlying the predefined topics. The semantic focused crawlers can briefly be classified into two clusters – the ontology-
based semantic focused crawlers and the non-ontology-based semantic focused crawlers, in terms of use of ontologies. The former
refers to the crawlers which make use of ontologies to represent the knowledge underlying topics and Web pages, and link the fetched
Web pages with semantically relevant ontological concepts, with the purpose of focused crawling and Web page classification 8-13.
The latter refers to the crawlers that make use of other Semantic Web technologies for focused crawling and Web page classification
14-17. According to a survey conducted by Dong et al. 18, it is found that most of the crawlers in this domain belong to the first cluster.
However, the limitation of the ontology-based semantic focused crawlers is that their crawling performance crucially depends on the
quality of ontologies. Furthermore, the quality of ontologies may be affected by two issues. The first issue is that, as it is well known
that an ontology is the formal representation of specific domain knowledge 19 and ontologies are designed by domain experts, there
may exist a gap between the domain experts’ understanding to the domain knowledge and the domain knowledge that exists in the real
world. The second issue is that, knowledge in the real world is dynamic and is in a persistently evolving process, compared with
relatively static ontologies. These two contradictory situations could lead to the problem that ontologies sometimes cannot precisely
represent the real-world knowledge, considering the issues of differentiation and dynamism. The reflection of this problem in the field
of semantic focused crawling is that the ontologies used by semantic focused crawlers cannot precisely represent the knowledge
revealed in Web information, since Web information is mostly created or updated by human users with different knowledge
understandings, and human users are efficient learners of new knowledge. This eventual consequence of this problem could be
reflected in the gradually descending curves in the performance of semantic focused crawlers.
In order to solve the defect of ontologies and maintain or enhance the performance of semantic focused crawlers, researchers start to
pay their attention to enhancing semantic focused crawling technologies by integrating them with ontology learning technologies. The
goal of ontology leaning is to semi-automatically extract facts or patterns from corpus or data and turn facts and patterns into machine-
readable ontologies 20. Various techniques have been designed for ontology learning, such as statistics-based techniques, linguistics (or
natural language processing)-based techniques, logic-based techniques, etc… These techniques can also be classified into supervised
techniques, semi-supervised techniques, and unsupervised techniques from the perspective of learning control. Obviously, ontology-
learning-based techniques can be used to solve the issue of semantic focused crawling, by learning new knowledge from crawled
documents and integrate the new knowledge with ontologies in order to persistently refine the ontologies.
In the rest of this section, we will review the existing works in the field of ontology learning-based semantic focused crawling. It is
found that few studies have been conducted in this field.
3
Zheng et al. 5 proposed a supervised ontology-learning-based focused crawler that aims to maintain the harvest rate of the crawler in
the crawling process. The main idea of this crawler is to construct an artificial neural network (ANN) model to determine the
relatedness between a Web page and an ontology. Given a domain-specific ontology and a topic represented by a concept in the
ontology, a set of relevant concepts are selected to represent the background knowledge about the topic, by counting the distance
between the topic concept and the other concepts in the ontology. The crawler then calculates the term frequencies of the relevant
concepts occurring in the visited Web pages. Next, the authors used the backpropagation algorithm to train a three-layer feedforward
ANN, the specification of which is shown in Table 1. The output of the ANN is the relevance score between the topic and a Web page.
The training process follows a supervised paradigm, by which the ANN is trained by labeled Web pages. The training will not stop
until the root mean square error (RMSE) is smaller than 0.01. The limitations of this approach are that, 1) it can only be used to
enhance the harvest rate of crawling but does not have the function of classification; 2) it cannot be used to evolve ontologies by
enriching the vocabulary of ontologies; and 3) the supervised learning may not work within an uncontrolled Web environment with
unpredictable new terms.
Input 1st layer Hidden layer Output layer
Frequency of relevant
concepts - xi Linear function yj = Wji xi Sigmoid transfer function
1
1j
j
y
z
e
Sigmoid transfer function
1
1j
z
o
e
Notion xi (i = 1…n) - a vector;
n - Number of relevant concepts in an ontology Wji (j = 1…4, i = 1…n) - a weight matrix
Table 1 Zheng et al.’s ANN model
Su et al. 4 proposed an unsupervised ontology-learning-based focused crawler in order to compute the relevance scores between topics
and Web pages. Given a specific domain ontology and a topic represented by a concept in this ontology, the relevance score between a
Web page and the topic is the weighted sum of the occurrence frequencies of all the concepts of the ontology in the Web page. The
original weight of each concept Ck is (,)
1.00 k
k
dC t
O
C
Wn , where n is a predefined discount factor, and d(Ck, t) is the distance between
the topic concept t and Ck. Next, this crawler makes use of reinforcement learning, which is a probabilistic framework for learning
optimal decision making from reward or punishment 21, in order to train the weight of each concept. The learning step follows an
unsupervised paradigm, which uses the crawler to download a number of Web pages and learn statistics based on these Web pages.
The learning step can be repeated many times. The weight of a concept Ck to a topic t in learning step m is mathematically expressed
as follows:
111
() (|)
() ( ) ()
kkkk
t
mmmm
kkkc
CCCC
kkt
Pt C Pt C n N
WWWW
Pt PC Pt n N
(1)
where k
nis the number of Web pages in which Ck occurs, t
k
nis the number of Web pages in which Ck and t co-occurs, Nc is the total
number of Web pages crawled, and Nt is the number of Web pages in which t occurs. Compared with Zheng et al. 5’s approach, this
approach is able to classify Web pages by means of the concepts in an ontology, to learn the weights of relations between concepts,
and to work in an uncontrolled Web environment thanks to the unsupervised learning paradigm. The limitations of Su et al.’s approach
are that 1) it cannot be used to enrich the vocabulary of ontologies; 2) although the unsupervised learning paradigm can work in an
uncontrolled Web environment, it may not work well when numerous new terms emerge or when ontologies have limited vocabulary.
By means of a comparative analysis of the two ontology-based focused crawlers (Table 2), we found a common limitation, which is
that none of the two crawlers is able to really evolve ontologies by enriching their contents, namely their vocabularies. It is found that
both of the approaches attempt to use learning models to deduce the quantitative relationship between the occurrence frequencies of
the concepts in an ontology and the topic, which may not be applicable in the real Web environment. When numerous unpredictable
new terms outside the scope of the vocabulary of an ontology emerge in Web pages, these approaches cannot determine the
relatedness between the new terms and the topic, and cannot make use of the new terms for the relatedness determination, which could
result in the decline in their performance. Consequently, in order to address this research issue, we propose to design the SOF crawler,
in order to precisely discover, format and index relevant Web pages in the uncontrolled Web environment.
Zheng et al.’s crawler Su et al.’s crawler SOF Crawler
Learning paradigm Supervised Unsupervised Semi-supervised
4
Classification No Yes Yes
Term learning No No Yes
Relation learning No Yes Yes
Open and heterogeneous
environment
No No Yes
Table 2 Comparative analysis of the existing ontology-learning-based focused crawlers
3. System Architecture and Components
In this section, we introduce the system architecture and the functionalities of the components of the proposed SOF crawler.
The primary objective of this crawler is to maintain the precision of the ontology-based Web page focused crawling and classification,
by 1) enriching the vocabulary of ontologies, and 2) enabling the crawler itself to work in an uncontrolled Web environment. In order
to realize this objective, we propose a semi-supervised ontology learning approach, enabling the utilized ontology to evolve itself in an
uncontrolled environment, by learning unpredictable but semantically relevant terms extracted from Web pages.
We conclude four major functions of the proposed crawler as follows: 1) downloading Web pages from the Internet; 2) generating
metadata from Web pages, in which metadata are the semantic descriptions of Web pages; 3) using ontologies to classify relevant
metadata in order to classify relevant Web pages and filter out non-relevant Web pages; and 4) enriching the vocabulary of ontologies
by means of the terms extracted from Web pages. A sketch map of the ontology-based Web page classification is shown in Fig. 1.
Fig. 1 Sketch map of the ontology-based Web page classification
It needs to be noted that this crawler is built upon the semantic focused crawling frameworks designed in our previous research work 6,
7. In our previous research work, we designed two pure semantic focused crawlers, which do not have an ontology-learning function to
automatically evolve the utilized ontologies. This research aims to remedy this defect.
The system architecture and system workflow of the proposed SOF crawler is shown in Fig. 2. Basically the SOF crawler can be
divided into three components based on the functionalities, i.e., a storage component – the knowledge base, a processing component –
the crawling and processing module, and a computing component – the semi-supervised Web page classification and ontology
learning module. In the rest of this section, we will introduce the technical details regarding the three components.
5
Crawling
Term Extraction
Ontology
Preprocessing
Term Processing
Direct Concept-
Metadata
Matching(Y/N)
Y
Metadata
Generation and
Association
SVM-based Concept-
Metadata Matching
(Y/N) FilteringN
Ontology Learning
Y
Website
Web pages
N
Ontology Base
Metadata Base
Ontology
Metadata
Semi-supervised
Web page classification
and ontology learning
Knowledge Base
Crawling and
processing
Fig. 2 System architecture and system workflow of the SOF crawler
3.1 General Schemas of Ontology and Metadata in the Knowledge Base
The knowledge base consists of two components – an ontology base and a metadata base. The ontology base is designed with the
purpose of storing formal domain knowledge, i.e. ontologies, for ontology-based Web page filtering and classification. The metadata
base is used to store the semantically annotated information (i.e. metadata) with regard to Web pages. In order to realize the ontology-
based Web page filtering and classification as well as the semantic annotation, we define the general schemas respectively for
ontology and metadata. These two schemas can be customized according to the actual domain knowledge.
For the ontologies stored in the ontology base, it is reasonable to make use of a hierarchical ontology for Web page classification, in
which concepts are linked by the class/subclass relationship. Each concept represents the conceptualization of a specific topic, which
can be associated to semantically relevant Web pages. It needs to be noted that a Web page can be associated to more than one topics.
A subclass of a concept is a subtopic or a more specific topic of the topic represented by the concept. A superclass of a concept is the
upper topic or a more generalized topic of the topic represented by the concept. Therefore, taking into account the features of the
hierarchical ontologies, we can define the general schema of ontological concepts, instead of defining the general schema of a
hierarchical ontology. In addition to the class/subclass property, we define that each concept in a hierarchical ontology contains the
following elementary properties:
A conceptDescription property is a datatype property used to store the textual descriptions of a concept, which consists of one or
more phrases or sentences. Each phrase or sentence is a description or definition of a concept, which is defined by domain
experts. This property will be used in the process of Web page classification.
A learnedConceptDescription property is a datatype property that has the similar purpose as the conceptDescription property
does. The difference between the two properties is that the former is automatically learned from Web pages by the SOF crawler.
A linkedMetadata property is an object property used to associate between a concept and a semantically relevant metadata. This
property is used to semantically index the generated metadata by means of the concepts in an ontology.
For the metadata stored in the metadata base, a metadata is the semantic descriptions of a Web page, which contains the following
elementary properties:
A pageDescription property, which is a datatype property that stores the key terms and term frequencies used to describe the
6
topics of a Web page. The contents of this property are automatically extracted from the Web page by the SOF crawler, the
process of which will be introduced in Section 3.2. This property will be used for the forthcoming concept-metadata similarity
computation.
A URL property, which is a datatype property used to store the URL of the Web page to which this metadata corresponds.
A linkedConcept property, which is the inverse property of the linkedMetadata property. This property stores the URIs of the
semantically relevant concepts of the metadata. It needs to be noted that the metadata and the concepts can have a many-to-many
relationship.
3.2 System Workflow of the Modules
In this section, we introduce the functionalities of the crawling and processing module and the supervised Web page classification and
ontology learning module, in terms of the workflow of the proposed SOF crawler.
The crawling and processing module is designed with the purpose of crawling Web pages and processing the contents of Web pages
and ontologies for forthcoming computation. As can be seen in Fig. 2, the first process in this module is preprocessing, which is to
process the contents of the conceptDescription property of each concept in the ontology for the forthcoming concept-metadata
matching, before the SOF crawler starts crawling over the Internet. This process is realized by using Java WordNet Library3 (JWNL)
to implement tokenization, part-of-speech (POS) tagging, nonsense word filtering, stemming, synonym searching, and term weighting
for the conceptDescription properties of the concepts. For the task of term weighting, each term in the conceptDescription property is
associated with a weight, in order to indicate the particularity of this term in the ontology. Here we make use of the inverse document
frequency (IDF) model (based the assumption that the less frequently a term occurs, the more particularity the term has) for the weight
calculation. The term weight will be used for the forthcoming term processing process (Fig. 4). The algorithmic presentation of this
process is presented in Fig. 3.
Input: Cj are concepts of an ontology O, each concept Cj has a group of concept descr iptions CDjh, and each concept description
CDjhhasagroupoftermsCDjhl.
Output:root,synonyms,andweightofCDjhl–Wjhl.
Procedure:
forallconceptsCjdo
forallconceptsdescriptionsCDjhofaconceptCjdo
foralltermsCDjhlinaconceptdescriptionCDjhdo
RemovepunctuationsinCDjhl;
TokenizeCDjhl;
PerformPOStaggingforCDjhlbyWordNet;
RemovewordswithoutPOStagsinCDjhl;
PerformstemmingforCDjhlbyWordNet;
FindsynonymsofCDjhlfromWordNet;
CDjhl←CDjhl∪synonymsofCDjhl;
Wjhl←log |||∀∈
|||∀∈∩∃∈;
endfor
endfor
endfor
Fig. 3 Procedure of the preprocessing process
The second and third processes are crawling and term extraction. The missions of the two processes are to download a Web page from
the Internet at one time, and to extract required information from the downloaded Web page, according to the general metadata
schema defined in Section 3.1, in order to prepare the properties for generating a new metadata. These two processes are realized by
the semantic focused crawlers designed in our previous works 6, 7, in which the extraction rules and the templates are defined by
observing common patterns in the HTML codes. By means of these two processes, nearly all the properties of the metadata are
generated, except for the pageDescription property that contains unprocessed key terms
The third process is term processing, which is to process the contents of the pageDescription property of the metadata, in order to
prepare for the forthcoming concept-metadata matching. The implementation of this process is similar to the implementation of the
3 http://sourceforge.net/projects/jwordnet/
7
preprocessing process. The major differences are that 1) the term processing process does not need the function of synonym retrieval,
due to the provision of this function in the preprocessing process and the consideration of the computing cost; and 2) the term
processing process has a term frequency counting function, which is to count the frequency of the terms in the pageDescription
property. Similarly, the terms in the pageDescription property also needs to a weight for indicating their particularity. Here a term
matching function is designed for passing the weights of ontological terms obtained in the preprocessing process, in order to reduce
the computing cost in this real-time process. By means of this term matching function, the terms in the pageDescription property are
matched with the terms occurred in the conceptDescription properties of the concepts in an ontology. If two terms are matched, the
associated weight of the matched term in the ontology will be passed to the term in the pageDescription property; otherwise the term
in the pageDescription property will be regarded as a new term and assigned the maximum valid weight for its particularity, i.e., log
(number of concepts in the ontology), in terms of the IDF algorithm. The weights of terms will be used for the following SVM-based
concept-metadata matching process (Section 4.1). The algorithmic expression of the term processing process is shown in Fig. 4.
Input:PDisthepagedescriptionpropertyofaWebpageP,andPDcontainsagroupoftermsPDi.CjareconceptsofanontologyO,each
concepthasagroupofconceptdescriptionsCDjh.EachconceptdescriptionCDjhhasagroupoftermsCDjhl.EachtermCDjhl
isassociatedwithaweightWjhl.
Output:rootsofPDi,termfrequencyofPDi–TFi,termfrequencyofCDjhl–TFjhl,andweightofPDi–Wi.
Procedure:
foralltermsPDiinPdo
RemovepunctuationsinPDi
PerformPOStaggingforPDibyWordNet;
RemovewordswithoutPOStagsinPDi;
PerformstemmingforPDibyWordNet;
TFi←FrequencyofPDi;
foralltermsCDjhlinOdo
ifCDjhlPDithen
Wi←Wjhl;
TFjhl;
endif
endfor
ifWinullthen
Wilog|||∀ ∈ ;
endif
endfor
Fig. 4 Procedure of the term processing process
The rest part of the workflow can be integrated as a semi-supervised Web page classification and ontology learning module. The
detailed procedure of this module is described as follows: first of all, the direct string matching process examines whether or not the
content of the pageDescription property of a metadata is included in the conceptDescription and learnedConceptDescription
properties of a concept. If the answer is yes, then the concept and the metadata are regarded as semantically relevant. By means of the
metadata generation and association process, the metadata can then be generated and stored in the metadata base as well as associated
with the concept. If the answer is no, a Support Vector Machine (SVM)-based concept-metadata matching process will be invoked to
check the semantic relatedness between the metadata and the concept, by using a trained SVM model to assess the semantic
relatedness between the pageDescription property of the metadata and the phrases in the conceptDescription property of the concept,
the details of which will be introduced in Section 4. If the pageDescription property of the metadata is semantically relevant to any
phrases in the conceptDescription property of the concept, the metadata and the concept are regarded as semantically relevant, and the
contents of the pageDescription property of the metadata can be regarded as a new phrase for the learnedConceptDescription property
of the concept. The metadata is thus allowed to go through the metadata generation and association process; otherwise the metadata is
regarded as semantically non-relevant to the concept. The above process is repeated until all the concepts in the ontology are
compared to the metadata. If none of the concepts is semantically relevant to the metadata, this metadata is regarded as semantically
non-relevant to the domain represented by the ontology and will be filtered out. The algorithmic expression of the above processes is
revealed in Fig. 5.
Input:Cjareconceptsofanontology,eachconcepthasagroupofconceptdescriptionsCDjhandagroupoflearnedconcept
descriptions LCDjh. Each concept descriptionCDjh hasagroup of terms CDjhl, andeach learned concept description LCDjh
8
has a group of terms LCDj hl. Each term CDjh l is associated with a weight Wjhl. P is a Web page, P has a pagedescription
propertyPD,andPDcontainsagroupoftermsPDi.EachtermPDiisassociatedwithaweightWi.
Output:1generate a metadata MifPisrelevanttoanyconceptsCj, 2 associate thesemantically relevant concepts CjandM, and 3
updatethelearnedconceptdescriptionsLCDjhifPisnotinCDjhandLCDjh.
Procedure:
forallconceptsCjdo
foralltheconceptsdescriptionsCDjhandthelearnedconceptdescriptionsLCDjhofaconceptCjdo
ifPD≡∃CDjhtrue∩lengthofPDlengthofCDjh∪PD≡∃LCDjhthen
ifMdoesnotexistthen
GenerateametadataM;
endif
AssociatebetweenMandCjbymutuallyreferencingtheirURIs;
break;
else
SimsPD,CDjh←thesimilarityvaluebetweenPDandCDjhbyaSemantic‐basedStringMatchingAlgorithm;
SimpPD,CDjh←thesimilarityvaluebetweenPDandCDjhby a Probability‐based String Matching
Algorithm;
ifSVMSimsPD,CDjh,SimpPD,CDjh1then
ifMdoesnotexistthen
GenerateametadataM;
endif
AssociatebetweenMandCjbymutuallyreferencingtheirURIs;
LCDjh←LCDjh∪PD;//AddthepagedescriptionPDintoLCDjh
break;
endif
endif
endfor
endfor
Fig. 5 Procedure of the semi-supervised Web page classification and ontology learning module
Although the SVM-based concept-metadata matching process is a supervised process, since the inputs of the SVM model are
similarity values between two phrases in the pageDescription property of a metadata and in the conceptDescription property of a
concept (which will be introduced in Section 4), the SVM determines the semantic relatedness between two phrases based on their
semantic similarity, regardless of their actual content. The following ontology learning process, therefore, is able to learn the
uncontrolled new definitions (phrases) that may contain unpredictable new terms, based on the similarity values controlled by the
SVM model, which is thus viewed as a semi-supervised ontology learning process.
4. Support Vector Machine (SVM)-based Concept-Metadata Matching Model
In this section, we introduce the mathematical models utilized in the SVM-based concept-metadata matching process (Fig. 2). In the
supervised Web page classification and ontology learning module, if the descriptive terms extracted from a Web page, i.e., the
pageDescription property of a metadata, cannot directly match with any phrases in the conceptDescription property or
learnedConceptDescription property of a concept, the SVM-based concept-metadata matching process will be invoked to
mathematically examine their semantic relatedness. Fig. 6 indicates the workflow of the SVM-based concept-metadata matching
process. Each concept is associated with a particular SVM classifier, where the inputs of the classifier are the results of a semantic-
based string matching (SSM) algorithm and a probability-based string matching (PSM) algorithm between the concept and a metadata,
and the output of the SVM classifier is their binary semantic relatedness (relevant/non-relevant). From Section 4.1 to 4.3, we will
respectively introduce the SSM algorithm, the PSM algorithm, and the SVM classifier.
9
Fig. 6 Workflow of the SVM-based concept-metadata matching process
4.1 Semantic-based String Matching Algorithm
The key idea of the SSM algorithm is to measure the text similarity between a phrase in the conceptDescription property (abbreviated
as a concept description) of a concept and the pageDescription property (abbreviated as a page description) of a metadata, by means of
WordNet4 and a semantic similarity model.
A concept description and a page description can be regarded as two groups of terms with weights after the preprocessing and term
processing phase, in which terms in a concept description have their synonyms, and terms in a page description have their term
frequencies in the corresponding Web page. Therefore, we designed a weighted Dice’s coefficient algorithm to measure the semantic
similarity between a concept description and a page description, taking into account the requirements of high precision and short
response time for focused crawling. The mathematical expression of the weighted Dice’s coefficient algorithm is presented as follows:
4.2 Probability-based String Matching Algorithm
The PSM algorithm is a complementary solution for measuring the relevance between concepts and metadata, by measuring the co-
occurrence frequencies of a page description of a metadata and a concept description of a concept in the crawled Web pages, based on
a probabilistic model 22. In the crawling process and the subsequent processes indicated in Fig. 2, the SASF crawler downloads k Web
pages at the beginning, and automatically obtains the statistical data from the k Web pages, in order to compute the relevance between
the page description (PDi) of a metadata and a concept description (CDj,h) of a concept (Cj). The PSM algorithm follows a
unsupervised training paradigm, which aims at seeking the maximum probability that CDj,h and PDi co-occur in the trained Web pages.
A graphical representation of the PSM algorithm is shown in Fig. 8. The PSM algorithm is mathematically expressed as follows:
, ,
,,
,
,,,, ,
maxSim ( , ) max [ ( | ) ( | )] max
jj jj
jj
jh i
Pi jh j jh j i
CD C CD C jh i
nn
PD CD P CD CD P CD PD nn
(3)
where CDj,θ is a concept description of Cj, ,
,
j
j
h
n
is the number of Web pages that contain both CDj,θ and CDj,h, nj,h is the number of Web
pages that contain CDj,h,,j
i
n
is the number of Web pages that contain both CDj,θ and PDi, and ni is the number of Web pages that
contain PDi.
maxSim ( , 2) max[ ( | 2) ( | )]
PCDj C
PD CD P CDj CD P CDj PD
Fig. 7. Graphical representation of the PSM algorithm
4.3 SVM Classification Algorithm
The SVM classifier for each concept is designed to best aggregate the results of the SSM algorithm and the PSM algorithm in order to
decide on the semantic relatedness between a concept description and a page description, through a supervised training paradigm. This
4 http://wordnet.princeton.edu/
10
classifier provides a binary classification function (relevant/non-relevant), which is characterized by a hyperplane in a given feature
space.
Let X = [0, 1]×[0, 1] be the feature space with feature vectors xi = (maxSimS(PDi, CDj,h), maxSimP(PDi, CDj,h)), in which the features
respectively represent the results of the SSM algorithm and the PSM algorithm. The yi value of the training set equals -1 for a pair of
semantically non-relevant concept description and page description, and 1 for the relevant pair. The yi values in the training set are
subjectively defined by domain experts. Eventually, the input of each SVM classifier is a set of training tuples {(x1, y1), …,(xm, ym)}
with xiX and yi{-1, 1}.
The result of a SVM is a maximum-margin hyperplane, which separates training examples in the feature space as precise as possible,
while the distance of the closest members on each side is maximized. This is expressed in the following optimization problem:
1
1
minimize , , : ,
2
N
T
i
i
wb w w C
(4)
where w and b describe the optimal hyperplane. The error term 1
N
i
i
C
is introduced to allow for outliers in a non-linear separable
training set, where i
is a slack variable and the penalty parameter C controls the trade-off between i
and the size of the margin.
is
a predefined function which maps features into a higher dimensional space, and a kernel function is required to reduce the
computational load in this process. In this experiment, we employed the radial basis function (RBF) as the kernel, as the number of
instances is far larger than the number of features. The RBF kernel function is defined as: 2
||
(, ) , 0
ij
xx
ij
Kx x e
. We conducted
the v-fold cross validation and grid-search approach proposed in 23 in order to find the optimal C and γ. The theoretical details of SVM
can be referenced from 24.
5. System Implementation and Evaluation
In this section, in order to systematically evaluate the framework of the proposed SOF crawler, we implement a prototype of this
crawler, and compare the performance of the crawler with the existing work reviewed in Section 2, based on several performance
indicators adopted from the information retrieval (IR) field.
5.1 Prototype Implementation and Test Environment Setup
The overall framework of the SOF crawler is built in Java within the platform of Eclipse 3.7.15. The general ontology schema and
general metadata schema is implemented in OWL-DL within the platform of Protégé 3.4.76. The OWL API7 is utilized to access the
OWL file and the libSVM8 Java library is utilized for the implementation of the SVM classifiers. With the purpose of comparatively
analyzing between our work and the existing work, i.e., Zheng et al.’s and Su et al.’s ontology learning-based semantic focused
crawlers, we implement a prototype for each crawler in Java, in which the ANN model used by Zheng et al.’s crawler is built in
Encog9.
The test environment is initialized by two tasks: (1) the selection of a candidate ontology for ontology-based focused crawling and/or
classification, and ontology learning, and (2) the selection of Web pages for crawling, training, and testing. For the first task, we use a
previously designed mining service ontology, which represents the domain knowledge in the mining service industry. This mining
service ontology follows a four-level hierarchical structure, and consists of 158 concepts, in which each concept is defined by
following the general schema of ontological concepts introduced in Section 3.1. The mining service ontology is mostly referenced
from Wikipedia10, Australian Bureau of Statistics11, and the websites of nearly 200 Australian and international mining service
companies. The details of the mining service ontology can be referenced from 22. For Task 2, as mentioned in Section 2, one common
defect of the existing ontology-learning-based focused crawlers is that, these crawlers cannot keep their performance in an
5 http://www.eclipse.org/
6 http://protege.stanford.edu/
7 http://owlapi.sourceforge.net/
8 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
9 http://code.google.com/p/encog-java/
10 http://en.wikipedia.org/
11 http://www.abs.gov.au/
subject to 1 : ( ( ) ) 1 , 0,
T
ii ii
iNyw x b
11
uncontrolled Web environment with unpredictable new terms, due to the limitations of the adopted ontology learning approaches.
Hence, our proposed SOF crawler aims to remedy this defect, by following a semi-supervised Web page classification and ontology
learning paradigm. In order to evaluate our crawler and the existing crawlers in an uncontrolled Web environment, we choose two
mainstream mining service advertising websites – Australian Kompass 12 (abbreviated as Kompass below) and Australian
Yellowpages®13 (abbreviated as Yellowpages® below), as the testing data source. There are around 800 downloadable mining-related
service or product advertisements registered in Kompass, and around 3200 similar advertisements registered in Yellowpages®, all of
which are published in English. Since Zheng et al’s crawler needs a supervised training process, Su et al.’s crawler needs an
unsupervised training process, and our proposed SOF crawler needs both supervised and unsupervised training processes, we label the
Web pages from Kompass, and use these Web pages as the training set for all of these crawlers. Subsequently, we test and compare
the performance of these crawlers on the task of crawling and classifying the Web pages from Yellowpages®, based on the
performance indicators introduced in the next section, with the purpose of evaluating their capability in this heterogeneous
environment.
5.2 Performance Indicators
We define the following parameters for comparing between our crawler and the existing ontology-learning-based focused crawlers.
All the indicators are adopted from the field of IR and need to be redefined in order to be applied in the scenario of ontology-based
focused crawling.
Harvest rate is used to measure the harvesting ability of a crawler. Harvest rate for a crawler ε after crawling μ Web pages is defined
as follows:
||
()||
HR
(5)
where||
is the number of associated metadata from the Web pages, and||
is the number of generated metadata from the
Web pages.
Precision is used to measure the preciseness of a crawler. Precision for a concept Cj after crawling μ Web pages is defined as follows:
|{ | }|
() ||
ijij
j
j
PC
(6)
where
j
is the set of associated metadata from the Web pages for Cj,||
j
is the number of associated metadata from the Web
pages for Cj, and
j
is the set of relevant metadata from the Web pages for Cj. It needs to be noted that the set of relevant metadata
need to be manually identified by peers before the evaluation.
Recall is used to measure the effectiveness of a crawler. Recall for a concept Cj after crawling μ Web pages is defined as follows:
|{ | }|
() ||
ijij
j
j
RC
(7)
where||
j
is the number of the relevant metadata from the Web pages for Cj.
Harmonic mean is used to measure the aggregated performance of a crawler. Harmonic mean for a concept Cj after crawling μ Web
pages is defined as follows:
() ()
() ()()
j
j
j
jj
PC RC
HM C PC RC
(8)
12 http://au.kompass.com/
13 http://www.yellowpages.com.au/
12
Fallout is used to measure the inaccuracy of a crawler. Fallout for a concept Cj after crawling μ Web pages is defined as follows:
|{ | }|
() ||
j
iji
j
j
FC
(9)
where
j
is the set of non-relevant metadata from the Web pages for Cj, and||
j
is the number of non-relevant metadata from the
Web pages for Cj. It needs to be noted that the set of non-relevant metadata need to be manually identified by peers before the
evaluation
Crawling time is used to measure the efficiency of a crawler. Crawling time of the SOF crawler for a Web page is defined as the time
interval of processing the Web page from the crawling process to the metadata generation and association process or to the filtering
process, as shown in Fig. 2.
5.3 System Evaluation
In this section, we evaluate the feasibility of the SOF crawler, by comparing its performance with the existing ontology-learning-based
focused crawlers, i.e., Zheng et al.’s crawler and Su et al.’s crawler introduced in Section 2. We compare the performance of the three
crawlers based on the six parameters introduced in Section 5.2. Since Zheng et al.’s crawler does not have the function of
classification, we only obtain its performance data on harvest rate and crawling time.
Fig. 8. Comparison of the SOF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on harvest rate
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumbrtofCrawledWebPages
HarvestRate
SOF Su Zheng
13
Fig. 9. Comparison of the SOF crawler and Su et al.’s crawler on precision
Fig. 10. Comparison of the SOF crawler and Su et al.’s crawler on recall
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
200 400 600 800 1000 1200 1400 160 0 1800 2000 2200 2400 260 0 2800 3000 3200
NumbrtofCrawledWebPages
Precision
SOF Su
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
50.00%
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
Recall
SOF Su
14
Fig. 11. Comparison of the SOF crawler and Su et al.’s crawler on Harmonic Mean
Fig. 12. Comparison of the SOF crawler and Su et al.’s crawler on fallout rate
The performance of the SIF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on crawling time is shown in Fig. 14. It can be seen
that initially there is no big difference among the three crawlers. Su et al.’s crawler takes the highest crawling time. Since Zheng et
al.’s crawler does not execute the task of classification, it needs less crawling time.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
HarmonicMean
SOF Su
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
200 400 600 800 1000 1200 1400 1600 1800 200 0 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
FalloutRate
SOF Su
15
Fig. 13. Comparison of the SOF crawler, Su et al.’s crawler, and Zheng et al.’s crawler on crawling time
6. Conclusion and Future Works
In this paper, we presented the framework of an innovative semi-supervised ontology-learning-based focused crawler – the SOF
crawler, in order to maintain the performance of ontology-based semantic focused crawling in an open and heterogeneous Web
environment. Due to the limitations in the adopted ontology learning approaches, the existing ontology-learning-based focused
crawlers cannot work in an uncontrolled Web environment where contains unpredictable new terms. Hence, in the framework of the
SOF crawler, we proposed a semi-supervised ontology learning approach, which enables the utilized ontological concepts to
automatically learn new definitions from the semantically relevant Web information, while keeping its performance in focused
crawling and classification. A semantic-based string matching algorithm and a probability-based string matching algorithm were
designed to measure the semantic relatedness between ontological concepts and Web-information-generated metadata, respectively
from the perspectives of semantic similarity and statistical data. A SVM model was trained to eventually determine the binary
relatedness (relevant/non-relevant) between a pair of concept and metadata, by aggregating the results from the two algorithms. In
order to evaluate the research outcome, we built the prototypes of the SOF crawler and two existing ontology-learning-based focused
crawlers. Next, we tested and compared their performance based on several IR indicators, in a simulated heterogeneous Web
environment. The comparison results preliminarily prove the feasibility and technical advantages of the proposed SOF crawler.
For the future work, we will focus on the following research tasks in the area of ontology-learning-based focused crawling: 1) we will
try to incorporate other ontology learning approaches into this framework in order to achieve better performance for ontology-based
focused crawling and classification; and 2) we will test the performance of this framework in other domains by developing new topical
ontologies or modifying the existing topical ontologies according to the defined general schema of ontological concepts.
References
1Batzios,A.,Dimou,C.,Symeonidis,A.L.,andMitkas,P.A.:‘BioCrawler:Anintelligentcrawlerforthesemanticweb’,Expert
SystemswithApplications,2008,35,(1–2),pp.524‐530
2Aggarwal,C.C.,Al‐Garawi,F.,andYu,P.S.:‘IntelligentcrawlingontheWorldWideWebwitharbitrarypredicates’.Proc.
Proceedingsofthe10thinternationalconferenceonWorldWideWeb(WWW'01),NewYork,NY,USA2001pp.96‐105
3Chakrabarti,S.,Berg,M.v.d.,andDom,B.:‘Focusedcrawling:anewapproachtotopic‐specificWebresourcediscovery’.
Proc.ProceedingsoftheeighthinternationalconferenceonWorldWideWeb(WWW'99),NewYork,NY,USA1999pp.1623‐1640
4Su,C.,Gao,Y.,Yang,J.,andLuo,B.:‘Anefficientadaptivefocusedcrawlerbasedonontologylearning’.Proc.Proceedingsof
theFifthInt.Conf.onHybridIntelligentSyst.(HIS'05)RiodeJaneiro,Brazil,6‐9Nov.20052005pp.73‐78
5Zheng,H.‐T.,Kang,B.‐Y.,andKim,H.‐G.:‘Anontology‐basedapproachtolearnablefocusedcrawling’,Inform.Sciences,
2008,178,(23),pp.4512‐4522
6Dong,H.,andHussain,F.K.:‘Focusedcrawlingforautomaticservicediscovery,annotation,andclassificationinindustrial
digitalecosystems’,IEEETrans.Ind.Electron.,2011,58,(6),pp.2106‐2116
0
50000
100000
150000
200000
250000
300000
350000
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
NumberofCrawledWebPages
CrawlingTime(Unit:ms)
SOF Su Zheng
16
7Dong,H.,Hussain,F.K.,andChang,E.:‘Aframeworkfordiscoveringandclassifyingubiquitousservicesindigitalhealth
ecosystems’,J.ofComput.andSyst.Sci.,2011,77,(4),pp.687‐704
8Ehrig,M.,andMaedche,A.:‘Ontology‐focusedcrawlingofWebdocuments’.Proc.ProceedingsoftheEighteenthAnnual
ACMSymposiumonAppliedComputing(SAC2003),Melbourne,USA2003pp.9‐12
9Ganesh,S.,Jayaraj,M.,Kalyan,V.,andAghila,G.:‘Ontology‐basedWebcrawler’.Proc.Proceedingsofthe2004
InternationalConferenceonInformationTechnology:CodingandComputing(ITCC'04),LasVegas,USA2004pp.337‐341
10Halkidi,M.,Nguyen,B.,Varlamis,I.,andVazirgiannis,M.:‘THESUS:OrganizingWebdocumentcollectionsbasedonlink
semantics’,TheVLDBJournal,2003,12,(4),pp.320–332
11Huang,W.,Zhang,L.,Zhang,J.,andZhu,M.:‘Semanticfocusedcrawlingforretrievingecommerceinformation’,Journalof
Software,2009,4,(5),pp.436‐443
12Yuvarani,M.,Iyengar,N.C.S.N.,andKannan,A.:‘LSCrawler:aframeworkforanenhancedfocusedWebcrawlerbasedon
linksemantics’.Proc.Proceedingsofthe2006IEEE/WIC/ACMInternationalConferenceonWebIntelligence(WI'06),HongKong2006
pp.794‐800
13Toch,E.,Gal,A.,Reinhartz‐Berger,I.,andDori,D.:‘Asemanticapproachtoapproximateserviceretrieval’,ACM
TransactionsonInternetTechnology,2007,8,(1),pp.2‐31
14Can,A.B.,andBaykal,N.:‘MedicoPort:Amedicalsearchengineforall’,ComputerMethodsandProgramsinBiomedicine,
2007,86,(1),pp.73‐86
15Cesarano,C.,d'Acierno,A.,andPicariello,A.:‘Anintelligentsearchagentsystemforsemanticinformationretrievalonthe
internet’.Proc.ProceedingsoftheFifthInternationalWorkshoponWebInformationandDataManagement(WIDM'03),New
Orleans,USA2003pp.111‐117
16Batzios,A.,Dimou,C.,Symeonidis,A.L.,andMitkas,P.A.:‘BioCrawler:AnintelligentcrawlerfortheSemanticWeb’,Expert
SystemswithApplications,2008,35,(1‐2),pp.524‐530
17Liu,H.,Milios,E.,andJanssen,J.:‘ProbabilisticmodelsforfocusedWebcrawling’.Proc.Proceedingsofthe6thAnnualACM
InternationalWorkshoponWebInformationandDataManagement(WIDM'04),WashingtonD.C.,USA2004pp.16‐22
18Dong,H.,Hussain,F.,andChang,E.:‘Stateoftheartinsemanticfocusedcrawlers’,inGervasi,O.,Taniar,D.,Murgante,B.,
Lagana,A.,Mun,Y.,andGavrilova,M.(Eds.):‘ComputationalSci.andItsApplicat.‐ICCSA2009’(SpringerBerlin/Heidelberg,2009),
pp.910‐924
19Gruber,T.R.:‘Atranslationapproachtoportableontologyspecifications’,KnowledgeAcquisition,1993,5,(2),pp.199‐220
20Wong,W.,Liu,W.,andBennamoun,M.:‘Ontologylearningfromtext:Alookbackandintothefuture’,ACMComputing
Surveys,2011,X,(X),pp.Toappear
21Rennie,J.,andMcCallum,A.:‘UsingreinforcementlearningtospidertheWebefficiently’.Proc.ProceedingsoftheSixteenth
Int.Conf.onMach.Learning(ICML'99),Bled,Slovenia1999pp.335‐343
22Dong,H.,andHussain,F.K.:‘Self‐adaptivesemanticfocusedcrawlerforminingservicesinformationdiscovery’,IEEETrans.
Ind.Informat.,2012,Submitted
23Hsu,C.‐W.,Chang,C.‐C.,andLin,C.‐J.:‘Apracticalguidetosupportvectorclassification’,inEditor(Ed.)^(Eds.):‘BookA
practicalguidetosupportvectorclassification’(DepartmentofComputerScienceandInformationEngineering,NationalTaiwan
University,2007,edn.),pp.
24Boser,B.E.,Guyon,I.M.,andVapnik,V.N.:‘Atrainingalgorithmforoptimalmarginclassifiers’.Proc.Proceedingsofthefifth
annualworkshoponComputationallearningtheoryPittsburgh,Pennsylvania,UnitedStates1992pp.144‐152