Conference PaperPDF Available

Focused Web Crawler

January 2012

January 2012

Conference: 2012 International Conference on Information and Knowledge Management (ICIKM 2012)
At: Singapore

Authors:

Ayoub Elyasir

Universität Potsdam

Kalaiarasi Sonai Muthu Anbananthen

Multimedia University

Web crawler is a program that traverses the internet based on automated manner to download the web pages or partial pages' contents according to the user requirements. Small amount of work on web crawler has been published for various reasons. Web crawling and its techniques are still in the shadow and possess many secrets due to its involvement in the giant search engine applications where they tend to obscure it, as it is the secret recipe for their success. There is also secrecy involved to protect against search spamming and ranking functions so web crawling methods are rarely published or publically announced. Web crawler is as an important and fragile component for many applications, including business competitive intelligence, advertisements, marketing and internet usage statistics. In this work, we compare between the two main types of web crawlers: standard and focused to choose one of them and apply it in our latter framework for opinion mining in the education domain.

Focused Crawling using MEMM/CRF Models (Liu and Milios, Probabilistic Models for Focused Web Crawling 2010)

…

below shows the difference between standard and focused web crawlers:Comparison between standard and focused web crawler

…

Figures - uploaded by Ayoub Elyasir

Content may be subject to copyright.

Content uploaded by Ayoub Elyasir

Content may be subject to copyright.

Focused Web Crawler

Ayoub Mohamed H. Elyasir1, Kalaiarasi Sonai Muthu Anbananthen2

Multimedia University, Melaka, Malaysia

1Email: ayoub_it@msn.com, 2Email: kalaiarasi@mmu.edu.my

Abstract. Web crawler is a program that traverses the internet based on automated manner to download the

web pages or partial pages’ contents according to the user requirements. Small amount of work on web

crawler has been published for various reasons. Web crawling and its techniques are still in the shadow and

possess many secrets due to its involvement in the giant search engine applications where they tend to

obscure it, as it is the secret recipe for their success. There is also secrecy involved to protect against search

spamming and ranking functions so web crawling methods are rarely published or publically announced.

Web crawler is as an important and fragile component for many applications, including business competitive

intelligence, advertisements, marketing and internet usage statistics. In this work, we compare between the

two main types of web crawlers: standard and focused to choose one of them and apply it in our latter

framework for opinion mining in the education domain.

Keywords: Web Crawling, Focused Crawler, Search Engine, Uniform Resource Locator, Canonicalization

1. Introduction

Over the last decade, the World Wide Web has evolved from a number of pages to billions of diverse

objects. In order to harvest this enormous data repository, search engines download parts of the existing web

and offer Internet users access to this database through keyword search. One of the main components of

search engines is web crawler. Web crawler is a web service that assists users in their web navigation by

automating the task of link traversal, creating a searchable index of the web, and fulfilling searchers’ queries

from the index. That is, a web crawler automatically discovers and collects resources in an orderly fashion

from the internet according to the user requirements. Different researchers and programmers use different

terms to refer to the web crawlers like aggregators, agents and intelligent agents, spiders, due to the analogy

of how spiders and crawlers traverses through the networks, or the term (robots) where the web crawlers

traverses the web using automated manner.

To date various applications for web crawler have been introduced and developed to perform particular

objectives. Some of these applications are malicious in a way that penetrates the users’ privacy by collecting

information without their permission. However, web crawlers have applications with significant impact on

the market as it is mainly involved in the search engines, business competitive intelligence and internet usage

statistics. Unfortunately, web crawling is still in the shadow and possess many secrets due to its involvement

in the giant search engines applications where they tend to obscure it, as it is the secret recipe for their

success. There is also secrecy involved to protect against search spamming and ranking functions so web

crawling methods are rarely published or publically announced.

2. Web Crawling

Searching is the most prominent function all over the web, internet user tend to look into various topics

and interests every time he/she surfs the web. Web crawling is the technical synonym for internet searching

which giant search engines provide nowadays to the users at no cost. No client side elements needed outside

the browser to crawl through the web, crawling consists of two main logistics parts: crawling, the process of

2012 International Conference on Information and Knowledge Management (ICIKM 2012)

IPCSIT vol.45 (2012) © (2012) IACSIT Press, Singapore

149

finding documents and constructing the index; and serving, the process of receiving queries from searchers

and using the index to determine the relevant results.

We crawling is the means by which crawler collects pages from the Web. The result of crawling is a

collection of Web pages at a central or distributed location. Given the continuous expansion of the Web, this

crawled collection guaranteed to be a subset of the Web and, indeed, it may be far smaller than the total size

of the Web. By design, web crawler aims for a small, manageable collection that is representative of the

entire Web.

Web crawlers may differ from each other in the way they crawl web pages. This is mainly related to the

final application that the web crawling system will serve. Crawlers classified based on their functionality to

standard and focused. Standard crawler has a random behavior for collecting web pages while focused

crawler has a guided way to do the traversal process. Figure one below shows that standard crawler branches

generally through the nodes (web pages) regardless of the node domain, while focused crawler traverses

deeper and narrower toward a specific node domain. Another remark in Figure 1 is the starting node (root)

which is same for both standard and focused crawler.

A focused crawler ideally would like to download only web pages that are relevant to a particular topic

and avoid downloading all others. It predicts the probability that a link to a particular page is relevant before

actually downloading the page. A possible predictor is the anchor text of links. In another approach, the

relevance of a page is determined after downloading its content. Relevant pages sent to content indexing and

their contained URLs added to the crawl frontier; pages that fall below a relevance threshold are discarded

Figure 1: Standard versus Focused Crawler

2.1 Comparison

Table 1 below shows the difference between standard and focused web crawlers:

Table 1: Comparison between standard and focused web crawler

Crawler Type Standard Web Crawler Focused Web Crawler

Synonym No-selection web crawler Topical web crawler

Note: Topical web crawler may refer to a focused

web crawler that does not use a classifier but a

simple guided technique instead

Introduced by Various contributions Chakrabarti, Berg and Dom (1999)

Definition Traverses the internet in an

automated and pre-defined

manner with random web

pages collection

Traverses the same way as standard web crawler,

but it only collects pages similar to each other

based on the domain, application, inserted query,

etc…

Path Searching Random search and may

lose its way while

traversing the web

Narrowed searching path with steady

performance

Web Pages Not necessarily related or

linked to each other

Must be related to a particular criteria

150

Starting Seed Root seed Root seed with dependency on the web search

engine to provide the starting point

Ending Seed Some random seed Relevant to the traversed seeds

Robustness Prone to URL distortions Robust against any distortions because it follows

a relevant URL path

Discovery Wide radius but less

relevant web pages

Narrow radius with relevant web pages

Resource

Consumption

Less resource consumption

because of the basic path

traversing algorithms

High resource usage especially with distributed

focused crawlers that run on multiple

workstations

Page weight Assigns value to the web

page for priority reasons

Assigns value to the web page for priority and

relativity (credits) reasons

Performance

Dependency

Crawling is independent Crawling is dependent on the link richness within

a specific domain

Flexibility Customizable with lots of

options

Less flexible due to its dependency

Classifier No classification involve

but rely heavily on

traditional graph algorithms

like depth-first traversal or

breadth-first

Classify to relevant or not relevant pages using

Naïve Bayesian, Decision Trees, Breadth-First,

Neural Network or Support Vector Machine

(SVM) which outperforms the other methods

especially when it is applied on page contents

and link context

Overall Less resource consumption

and performance

Higher resource consumption and performance

with high quality collections of web pages

From the comparison, we find that focused crawler is a better choice for traversing through the internet.

The ability to narrow the search radius with specific and guided path makes the focused crawler quality wise

in terms of web page collection, in which attempts to identify the most related links, and skips the off-topic

links. Malformed URL that causes a false direction in the crawling path easily distorts standard crawler,

because it follows each link using breadth first algorithm and downloads them all on its crawling way.

Resources consumption is less in standard crawling, focused are still better choice though, due to the

availability of the computing resources nowadays under reasonable prices. Focused crawler is not as

customizable as standard crawler but the first has the ability to classify the results based on page contents

and link context. Additionally, commercial applications prefer focused crawler because of the domain

dependency and restriction where some crawl through topics and others crawl based on regions and locations.

3. Review on Focused Web Crawler

Search engine web sites are the most visited in the internet worldwide due to their importance in our

daily life. Web crawler is the dominant function or module in the entire World Wide Web (WWW) as it is

the heart of any search engine. Standard crawler is a powerful technique for traversing the web, but it is

noisy in terms of resource usage on both client and server. Thus, most of the researchers focus on the

architecture of the algorithms that are able to collect the most relevant pages with the corresponding topic of

interest. The term focused crawling was originally introduced by (Chakrabarti, Berg, & Dom, 1999) which

indicates the crawl of topic-specific web pages. In order to save hardware and network resources, a focused

web crawler analyzes the crawled pages to find links that are likely to be most relevant for the crawl and

ignore the irrelevant clusters of the web.

151

Chakrabarti, Berg and Dom (1999) descriped a focused web crawler with three components, a classifier

to evaluate the web page relevance to the chosen topic, a distiller to identify the relevant nodes using few

link layers, and a reconfigurable crawler that is governed by the classifier and distiller. They try to impose

various features on the designed classifier and distiller: Explore links in terms of their sociology, extract

specified web pages based on the given query, and explore mining communities (training) to improve the

crawling ability with high quality and less relevant web pages.

Web page credtis problem was addressed by (Diligenti, Coetzee, Lawrence, Giles and Gori, 2000), in

which the crawl paths chosen based on the number of pages and their values. They use context graph to

capture the link hierarchies within which valuable pages occur and provide reverse crawling capabilities for

more exhaustive search. They also concluded that focused crawling is the future and replacement of standard

crawling as long as large machine resources are available.

Suel and Shkapenyuk (2002) described the architecture and implementation of optimized distributed web

crawler which runs on multiple work stations. Their crawler is crash resistant and capable of scaling up to

hundreds of pages per second by increasing the number of participating nodes.

CROSSMARC approach was introduced by (Karkaletsis, Stamatakis, Horlock, Grover and Curran,

2003). CROSSMARC employs language techniques and machine learning for multi-lingual information

extraction and consists of three main components: site navigator to traverse web pages and forward the

collected information to (Page filtering) and (Link scoring). Page filtering is to filter the information based

on the given queries and link scoring sets the threshold likelihood of the crawled links.

Baeza-Yates (2005) highlighted that crawlers in the search engine are responsible for generating the

structured data and they are able to optimize the retrieving process using focused web crawler for better

search results. Castillo (2005) Designed a new model for web crawler, which was integrated with the search

engine project (WIRE) and provided an access to metdata that enables the web crawling process. He

emphasized on how to catpure the most relevant pages as there are infinite number of web pages in the

internet with weak association and relationship. He also stated that traversing only five layers from the home

page is enough to get overview snapshot of the corressponding web site, hence it saves more bandwidth and

avoid network congestion.

Rungsawang and Angkawattanawit (2005) attempt to enhance the crawling process by involving

knowledge bases to build the experience of learnable focused web crawlers. They show results of an

optimized focused web crawler that learn from the information collected by the knowledge base within one

domain or category. They have proposed three kinds of knowledge bases to help in collecting as many

relevant web pages and recognize the keywords related to the topic of interest.

Liu, Milios and Korba (2008) presented a framework for focused web crawler based on Maximum

Entropy Markov Models (MEMMs) that enhanced the working mechanism of the crawler to become among

the best Best-First on web data mining based on two metrics, precision and maximum average similarity.

Using MEMMs, they were able to exploit multiple overlapping and correlated features, including anchor text

and the keywords embedded in the URL. Through experiments, using MEMMs and combination of all

features in the focused web crawler performs better than using Viterbi algorithm and dependent only on

restricted number of features.

Batsakis, Petrakis and Milios (2009) evaluated various existing approaches to web crawling such as

Breadth-First, Best-First and Hidden Markov Model (HMM) crawlers. They proposed focused web crawler

based on HMM for learning, leading to relevant pages paths. They combined classic focused web crawler

attributes with the ideas from document clustering to result in optimized relevant path analysis.

Liu and Milios (2010) developed their previous framework (Liu, Milios and Korba, Exploiting Multiple

Features with MEMMs for Focused Web Crawling 2008), in which they proposed two probabilistic models

to build a focused crawler, MEMMs and Linear-chain Conditional Random Field (CRF) as shown in Figure

2. Their experiments show improvements on the focused crawling and gave advantage over context graph

(Diligenti, et al. 2000) and their previous model.

We provided an explanatory literature review only on focused crawler because of its popularity among

the researchers community. Various methods and algorithms embedded in the focused crawler to boost the

152

traversing performance and produce quality results such as, context graph, statistical classifiers, machine

learning techniques, information theory and entropy. Other techniques are in use by the giant search engines

to enhance their crawling oriented services based on various criteria like locations and regions, inserted

search query, language, user browsing history and page ranks.

Figure 2: Focused Crawling using MEMM/CRF Models (Liu and Milios, Probabilistic Models for Focused Web

Crawling 2010)

4. Conclusion

Web crawling is an initial component in many applications including search engines and opinion mining

frameworks. We compared between standard and focused web crawlers to understand which one is better

and apply it in our opinion mining framework in a future work.

5. Acknowledgement

This work was supported by Project “Mining Opinions Using Combination of Word Sense and Bag of

words approach for Educational Environments”, funded by the Fundamental Research Grant Scheme under

the 2010–2012.

6. References

[1] Baeza-Yates, Ricardo. "Applications of Web Query Mining." Springer, 2005: 7-22.

[2] Batsakis, Sotiris, Euripides Petrakis, and Evangelos Milios. "Improving the performance of focused web crawlers."

ELSEVIER, 2009.

[3] Castillo, Carlos. "EffectiveWeb Crawling." ACM, 2005 .

[4] Chakrabarti, Soumen, Martin van den Berg, and Byron Dom. "Focused crawling: a new approach to topic-specific

Web resource discovery." Elsevier, 1999.

[5] Diligenti, Coetzee, Lawrence, Giles, and Gori. "Focused Crawling Using Context Graphs." 26th International

Conference on Very Large Databases, VLDB 2000. Cairo, Egypt, 2000. 527–534.

[6] Karkaletsis, Vangelis, Konstantinos Stamatakis, James Horlock, Claire Grover, and James R. Curran. "Domain-

SpecificWeb Site Identification: The CROSSMARC Focused Web Crawler." Proceedings of the 2nd International

Workshop on Web Document Analysis (WDA2003). Edinburgh, UK, 2003.

[7] Liu, Hongyu, and Evangelos Milios. "Probabilistic Models for Focused Web Crawling." Computational

Intelligence, 2010.

[8] Liu, Hongyu, Evangelos Milios, and Larry Korba. "Exploiting Multiple Features with MEMMs for Focused Web

Crawling." NRC, 2008.

[9] Rungsawang, Arnon, and Niran Angkawattanawit. "Learnable topic-specific web crawler." Science Direct, 2005:

97–114.

[10] Suel, Torsten, and Vladislav Shkapenyuk. "Design and Implementation of a High-Performance Distributed Web

Crawler." Proceedings of the IEEE International Conference on Data Engineering. 2002.

153

Medical Consultation System based on Python Web crawler

Conference Paper

Dec 2021

Focused crawling: A new approach to topic-specific Web resource discovery

Article

Full-text available

Apr 2000
COMPUT NETW

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.

Improving the Performance of Focused Web Crawlers

Article

Full-text available

Oct 2009
DATA KNOWL ENG

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.

Learnable topic-specific web crawler

Article

Apr 2005

Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the approaches of the first crawling. However, no one has ever mentioned some important questions, such as how the crawler performs during the, next crawling attempts, can the crawler learn from experience to crawl more relevant web pages in an incremental way, etc. In this paper, we present an algorithm that covers the discussion of both the first and the consecutive crawling. For efficient result of the next crawling, we derive the information of previous crawling attempts to build some knowledge bases: starting URLs, topic keywords and URL prediction. These knowledge bases are used to build the experience of the learnable topic-specific web crawler to produce better result for the next crawling. Preliminary evaluation illustrates that the proposed web crawler can learn from experience to better collect the web pages under interest during the early period of consecutive crawling attempts.

Applications of Web query mining

Article

Jan 2005
Lect Notes Comput Sci

Ricardo Baeza-Yates

Server logs of search engines store traces of queries submitted by users, which include queries themselves along with Web pages selected in their answers. The same is true in Web site logs where queries and later actions are recorded from search engine referrers or from an internal search box. In this paper we present two applications based in analyzing and clustering queries. The first one suggest changes to improve the text and structure of a Web site and the second does relevance ranking boosting and query recommendation in search engines.

Focused Crawling using Context Graphs

Conference Paper

Sep 2000

Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossi- ble due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific cate- gory, and offer a potential solution to the currency problem. The major problem in focused crawl- ing is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the ex- pense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algo- rithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierar- chies within which valuable pages occur, as well as model content on documents that frequently co- occur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capa- bilities. Our algorithm shows significant perfor- mance improvements in crawling efficiency over standard focused crawling.

Probabilistic Models for Focused Web Crawling

Article

Jan 2004

A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information obtained from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as the quality of the current observations. To address this challenge, we propose capturing sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimating the distance of this document from a target. Within this framework, we propose two probabilistic models for focused crawling, Maximum Entropy Markov Model (MEMM) and Linear-chain Conditional Random Field (CRF). With MEMM, we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With CRF, a form of undirected graphical models, we focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations. We conclude with an experimental validation and comparison with focused crawling based on Best-First Search (BFS), Hidden Markov Model (HMM), and Context-graph Search (CGS). Please hold until Dec 2 2010 yes yes

Design and Implementation of a High-Performance Distributed Web Crawler

Conference Paper

Feb 2002

Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed Web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the, performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts

EffectiveWeb Crawling

Jan 2005

Carlos Castillo

Castillo, Carlos. "EffectiveWeb Crawling." ACM, 2005.

Domain-SpecificWeb Site Identification: The CROSSMARC Focused Web Crawler

Jan 2003

Vangelis Karkaletsis
Konstantinos Stamatakis
James Horlock
Claire Grover
James R Curran

Karkaletsis, Vangelis, Konstantinos Stamatakis, James Horlock, Claire Grover, and James R. Curran. "Domain-SpecificWeb Site Identification: The CROSSMARC Focused Web Crawler." Proceedings of the 2nd International Workshop on Web Document Analysis (WDA2003). Edinburgh, UK, 2003.

Exploiting Multiple Features with MEMMs for Focused Web Crawling

Jan 2008

Hongyu Liu
Evangelos Milios
Larry Korba

Liu, Hongyu, Evangelos Milios, and Larry Korba. "Exploiting Multiple Features with MEMMs for Focused Web Crawling." NRC, 2008.

Focused Web Crawler

Abstract and Figures

Recommended publications

"Germany and Turkey have distinct approaches to immigration and asylum” – 2017 Voltaire Prize recipi...

“Academic Work is Important for Society” – 2020 Voltaire Prize recipient Dr. Gábor Po...

For Tolerance and Respect for Differences – 2021 Voltaire Prize Call for Nominations

Web Crawling Methodology

Semantic ranking of web pages based on formal concept analysis

Learning Capable Focused Crawler for Information Technology Domain

An adaptive focused Web crawling algorithm based on learning automata

Web Searching With Logarithmic and Probability Measure