Conference PaperPDF Available

Focused Web Crawler

Authors:

Abstract and Figures

Web crawler is a program that traverses the internet based on automated manner to download the web pages or partial pages' contents according to the user requirements. Small amount of work on web crawler has been published for various reasons. Web crawling and its techniques are still in the shadow and possess many secrets due to its involvement in the giant search engine applications where they tend to obscure it, as it is the secret recipe for their success. There is also secrecy involved to protect against search spamming and ranking functions so web crawling methods are rarely published or publically announced. Web crawler is as an important and fragile component for many applications, including business competitive intelligence, advertisements, marketing and internet usage statistics. In this work, we compare between the two main types of web crawlers: standard and focused to choose one of them and apply it in our latter framework for opinion mining in the education domain.
Content may be subject to copyright.
Focused Web Crawler
Ayoub Mohamed H. Elyasir1, Kalaiarasi Sonai Muthu Anbananthen2
Multimedia University, Melaka, Malaysia
1Email: ayoub_it@msn.com, 2Email: kalaiarasi@mmu.edu.my
Abstract. Web crawler is a program that traverses the internet based on automated manner to download the
web pages or partial pages’ contents according to the user requirements. Small amount of work on web
crawler has been published for various reasons. Web crawling and its techniques are still in the shadow and
possess many secrets due to its involvement in the giant search engine applications where they tend to
obscure it, as it is the secret recipe for their success. There is also secrecy involved to protect against search
spamming and ranking functions so web crawling methods are rarely published or publically announced.
Web crawler is as an important and fragile component for many applications, including business competitive
intelligence, advertisements, marketing and internet usage statistics. In this work, we compare between the
two main types of web crawlers: standard and focused to choose one of them and apply it in our latter
framework for opinion mining in the education domain.
Keywords: Web Crawling, Focused Crawler, Search Engine, Uniform Resource Locator, Canonicalization
1. Introduction
Over the last decade, the World Wide Web has evolved from a number of pages to billions of diverse
objects. In order to harvest this enormous data repository, search engines download parts of the existing web
and offer Internet users access to this database through keyword search. One of the main components of
search engines is web crawler. Web crawler is a web service that assists users in their web navigation by
automating the task of link traversal, creating a searchable index of the web, and fulfilling searchers’ queries
from the index. That is, a web crawler automatically discovers and collects resources in an orderly fashion
from the internet according to the user requirements. Different researchers and programmers use different
terms to refer to the web crawlers like aggregators, agents and intelligent agents, spiders, due to the analogy
of how spiders and crawlers traverses through the networks, or the term (robots) where the web crawlers
traverses the web using automated manner.
To date various applications for web crawler have been introduced and developed to perform particular
objectives. Some of these applications are malicious in a way that penetrates the users’ privacy by collecting
information without their permission. However, web crawlers have applications with significant impact on
the market as it is mainly involved in the search engines, business competitive intelligence and internet usage
statistics. Unfortunately, web crawling is still in the shadow and possess many secrets due to its involvement
in the giant search engines applications where they tend to obscure it, as it is the secret recipe for their
success. There is also secrecy involved to protect against search spamming and ranking functions so web
crawling methods are rarely published or publically announced.
2. Web Crawling
Searching is the most prominent function all over the web, internet user tend to look into various topics
and interests every time he/she surfs the web. Web crawling is the technical synonym for internet searching
which giant search engines provide nowadays to the users at no cost. No client side elements needed outside
the browser to crawl through the web, crawling consists of two main logistics parts: crawling, the process of
2012 International Conference on Information and Knowledge Management (ICIKM 2012)
IPCSIT vol.45 (2012) © (2012) IACSIT Press, Singapore
149
finding documents and constructing the index; and serving, the process of receiving queries from searchers
and using the index to determine the relevant results.
We crawling is the means by which crawler collects pages from the Web. The result of crawling is a
collection of Web pages at a central or distributed location. Given the continuous expansion of the Web, this
crawled collection guaranteed to be a subset of the Web and, indeed, it may be far smaller than the total size
of the Web. By design, web crawler aims for a small, manageable collection that is representative of the
entire Web.
Web crawlers may differ from each other in the way they crawl web pages. This is mainly related to the
final application that the web crawling system will serve. Crawlers classified based on their functionality to
standard and focused. Standard crawler has a random behavior for collecting web pages while focused
crawler has a guided way to do the traversal process. Figure one below shows that standard crawler branches
generally through the nodes (web pages) regardless of the node domain, while focused crawler traverses
deeper and narrower toward a specific node domain. Another remark in Figure 1 is the starting node (root)
which is same for both standard and focused crawler.
A focused crawler ideally would like to download only web pages that are relevant to a particular topic
and avoid downloading all others. It predicts the probability that a link to a particular page is relevant before
actually downloading the page. A possible predictor is the anchor text of links. In another approach, the
relevance of a page is determined after downloading its content. Relevant pages sent to content indexing and
their contained URLs added to the crawl frontier; pages that fall below a relevance threshold are discarded
Figure 1: Standard versus Focused Crawler
2.1 Comparison
Table 1 below shows the difference between standard and focused web crawlers:
Table 1: Comparison between standard and focused web crawler
Crawler Type Standard Web Crawler Focused Web Crawler
Synonym No-selection web crawler Topical web crawler
Note: Topical web crawler may refer to a focused
web crawler that does not use a classifier but a
simple guided technique instead
Introduced by Various contributions Chakrabarti, Berg and Dom (1999)
Definition Traverses the internet in an
automated and pre-defined
manner with random web
pages collection
Traverses the same way as standard web crawler,
but it only collects pages similar to each other
based on the domain, application, inserted query,
etc…
Path Searching Random search and may
lose its way while
traversing the web
Narrowed searching path with steady
performance
Web Pages Not necessarily related or
linked to each other
Must be related to a particular criteria
150
Starting Seed Root seed Root seed with dependency on the web search
engine to provide the starting point
Ending Seed Some random seed Relevant to the traversed seeds
Robustness Prone to URL distortions Robust against any distortions because it follows
a relevant URL path
Discovery Wide radius but less
relevant web pages
Narrow radius with relevant web pages
Resource
Consumption
Less resource consumption
because of the basic path
traversing algorithms
High resource usage especially with distributed
focused crawlers that run on multiple
workstations
Page weight Assigns value to the web
page for priority reasons
Assigns value to the web page for priority and
relativity (credits) reasons
Performance
Dependency
Crawling is independent Crawling is dependent on the link richness within
a specific domain
Flexibility Customizable with lots of
options
Less flexible due to its dependency
Classifier No classification involve
but rely heavily on
traditional graph algorithms
like depth-first traversal or
breadth-first
Classify to relevant or not relevant pages using
Naïve Bayesian, Decision Trees, Breadth-First,
Neural Network or Support Vector Machine
(SVM) which outperforms the other methods
especially when it is applied on page contents
and link context
Overall Less resource consumption
and performance
Higher resource consumption and performance
with high quality collections of web pages
From the comparison, we find that focused crawler is a better choice for traversing through the internet.
The ability to narrow the search radius with specific and guided path makes the focused crawler quality wise
in terms of web page collection, in which attempts to identify the most related links, and skips the off-topic
links. Malformed URL that causes a false direction in the crawling path easily distorts standard crawler,
because it follows each link using breadth first algorithm and downloads them all on its crawling way.
Resources consumption is less in standard crawling, focused are still better choice though, due to the
availability of the computing resources nowadays under reasonable prices. Focused crawler is not as
customizable as standard crawler but the first has the ability to classify the results based on page contents
and link context. Additionally, commercial applications prefer focused crawler because of the domain
dependency and restriction where some crawl through topics and others crawl based on regions and locations.
3. Review on Focused Web Crawler
Search engine web sites are the most visited in the internet worldwide due to their importance in our
daily life. Web crawler is the dominant function or module in the entire World Wide Web (WWW) as it is
the heart of any search engine. Standard crawler is a powerful technique for traversing the web, but it is
noisy in terms of resource usage on both client and server. Thus, most of the researchers focus on the
architecture of the algorithms that are able to collect the most relevant pages with the corresponding topic of
interest. The term focused crawling was originally introduced by (Chakrabarti, Berg, & Dom, 1999) which
indicates the crawl of topic-specific web pages. In order to save hardware and network resources, a focused
web crawler analyzes the crawled pages to find links that are likely to be most relevant for the crawl and
ignore the irrelevant clusters of the web.
151
Chakrabarti, Berg and Dom (1999) descriped a focused web crawler with three components, a classifier
to evaluate the web page relevance to the chosen topic, a distiller to identify the relevant nodes using few
link layers, and a reconfigurable crawler that is governed by the classifier and distiller. They try to impose
various features on the designed classifier and distiller: Explore links in terms of their sociology, extract
specified web pages based on the given query, and explore mining communities (training) to improve the
crawling ability with high quality and less relevant web pages.
Web page credtis problem was addressed by (Diligenti, Coetzee, Lawrence, Giles and Gori, 2000), in
which the crawl paths chosen based on the number of pages and their values. They use context graph to
capture the link hierarchies within which valuable pages occur and provide reverse crawling capabilities for
more exhaustive search. They also concluded that focused crawling is the future and replacement of standard
crawling as long as large machine resources are available.
Suel and Shkapenyuk (2002) described the architecture and implementation of optimized distributed web
crawler which runs on multiple work stations. Their crawler is crash resistant and capable of scaling up to
hundreds of pages per second by increasing the number of participating nodes.
CROSSMARC approach was introduced by (Karkaletsis, Stamatakis, Horlock, Grover and Curran,
2003). CROSSMARC employs language techniques and machine learning for multi-lingual information
extraction and consists of three main components: site navigator to traverse web pages and forward the
collected information to (Page filtering) and (Link scoring). Page filtering is to filter the information based
on the given queries and link scoring sets the threshold likelihood of the crawled links.
Baeza-Yates (2005) highlighted that crawlers in the search engine are responsible for generating the
structured data and they are able to optimize the retrieving process using focused web crawler for better
search results. Castillo (2005) Designed a new model for web crawler, which was integrated with the search
engine project (WIRE) and provided an access to metdata that enables the web crawling process. He
emphasized on how to catpure the most relevant pages as there are infinite number of web pages in the
internet with weak association and relationship. He also stated that traversing only five layers from the home
page is enough to get overview snapshot of the corressponding web site, hence it saves more bandwidth and
avoid network congestion.
Rungsawang and Angkawattanawit (2005) attempt to enhance the crawling process by involving
knowledge bases to build the experience of learnable focused web crawlers. They show results of an
optimized focused web crawler that learn from the information collected by the knowledge base within one
domain or category. They have proposed three kinds of knowledge bases to help in collecting as many
relevant web pages and recognize the keywords related to the topic of interest.
Liu, Milios and Korba (2008) presented a framework for focused web crawler based on Maximum
Entropy Markov Models (MEMMs) that enhanced the working mechanism of the crawler to become among
the best Best-First on web data mining based on two metrics, precision and maximum average similarity.
Using MEMMs, they were able to exploit multiple overlapping and correlated features, including anchor text
and the keywords embedded in the URL. Through experiments, using MEMMs and combination of all
features in the focused web crawler performs better than using Viterbi algorithm and dependent only on
restricted number of features.
Batsakis, Petrakis and Milios (2009) evaluated various existing approaches to web crawling such as
Breadth-First, Best-First and Hidden Markov Model (HMM) crawlers. They proposed focused web crawler
based on HMM for learning, leading to relevant pages paths. They combined classic focused web crawler
attributes with the ideas from document clustering to result in optimized relevant path analysis.
Liu and Milios (2010) developed their previous framework (Liu, Milios and Korba, Exploiting Multiple
Features with MEMMs for Focused Web Crawling 2008), in which they proposed two probabilistic models
to build a focused crawler, MEMMs and Linear-chain Conditional Random Field (CRF) as shown in Figure
2. Their experiments show improvements on the focused crawling and gave advantage over context graph
(Diligenti, et al. 2000) and their previous model.
We provided an explanatory literature review only on focused crawler because of its popularity among
the researchers community. Various methods and algorithms embedded in the focused crawler to boost the
152
traversing performance and produce quality results such as, context graph, statistical classifiers, machine
learning techniques, information theory and entropy. Other techniques are in use by the giant search engines
to enhance their crawling oriented services based on various criteria like locations and regions, inserted
search query, language, user browsing history and page ranks.
Figure 2: Focused Crawling using MEMM/CRF Models (Liu and Milios, Probabilistic Models for Focused Web
Crawling 2010)
4. Conclusion
Web crawling is an initial component in many applications including search engines and opinion mining
frameworks. We compared between standard and focused web crawlers to understand which one is better
and apply it in our opinion mining framework in a future work.
5. Acknowledgement
This work was supported by Project “Mining Opinions Using Combination of Word Sense and Bag of
words approach for Educational Environments”, funded by the Fundamental Research Grant Scheme under
the 2010–2012.
6. References
[1] Baeza-Yates, Ricardo. "Applications of Web Query Mining." Springer, 2005: 7-22.
[2] Batsakis, Sotiris, Euripides Petrakis, and Evangelos Milios. "Improving the performance of focused web crawlers."
ELSEVIER, 2009.
[3] Castillo, Carlos. "EffectiveWeb Crawling." ACM, 2005 .
[4] Chakrabarti, Soumen, Martin van den Berg, and Byron Dom. "Focused crawling: a new approach to topic-specific
Web resource discovery." Elsevier, 1999.
[5] Diligenti, Coetzee, Lawrence, Giles, and Gori. "Focused Crawling Using Context Graphs." 26th International
Conference on Very Large Databases, VLDB 2000. Cairo, Egypt, 2000. 527–534.
[6] Karkaletsis, Vangelis, Konstantinos Stamatakis, James Horlock, Claire Grover, and James R. Curran. "Domain-
SpecificWeb Site Identification: The CROSSMARC Focused Web Crawler." Proceedings of the 2nd International
Workshop on Web Document Analysis (WDA2003). Edinburgh, UK, 2003.
[7] Liu, Hongyu, and Evangelos Milios. "Probabilistic Models for Focused Web Crawling." Computational
Intelligence, 2010.
[8] Liu, Hongyu, Evangelos Milios, and Larry Korba. "Exploiting Multiple Features with MEMMs for Focused Web
Crawling." NRC, 2008.
[9] Rungsawang, Arnon, and Niran Angkawattanawit. "Learnable topic-specific web crawler." Science Direct, 2005:
97–114.
[10] Suel, Torsten, and Vladislav Shkapenyuk. "Design and Implementation of a High-Performance Distributed Web
Crawler." Proceedings of the IEEE International Conference on Data Engineering. 2002.
153
Article
Full-text available
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.
Article
Full-text available
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.
Article
Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the approaches of the first crawling. However, no one has ever mentioned some important questions, such as how the crawler performs during the, next crawling attempts, can the crawler learn from experience to crawl more relevant web pages in an incremental way, etc. In this paper, we present an algorithm that covers the discussion of both the first and the consecutive crawling. For efficient result of the next crawling, we derive the information of previous crawling attempts to build some knowledge bases: starting URLs, topic keywords and URL prediction. These knowledge bases are used to build the experience of the learnable topic-specific web crawler to produce better result for the next crawling. Preliminary evaluation illustrates that the proposed web crawler can learn from experience to better collect the web pages under interest during the early period of consecutive crawling attempts.
Article
Server logs of search engines store traces of queries submitted by users, which include queries themselves along with Web pages selected in their answers. The same is true in Web site logs where queries and later actions are recorded from search engine referrers or from an internal search box. In this paper we present two applications based in analyzing and clustering queries. The first one suggest changes to improve the text and structure of a Web site and the second does relevance ranking boosting and query recommendation in search engines.
Conference Paper
Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossi- ble due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific cate- gory, and offer a potential solution to the currency problem. The major problem in focused crawl- ing is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the ex- pense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algo- rithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierar- chies within which valuable pages occur, as well as model content on documents that frequently co- occur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capa- bilities. Our algorithm shows significant perfor- mance improvements in crawling efficiency over standard focused crawling.
Article
A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information obtained from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as the quality of the current observations. To address this challenge, we propose capturing sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimating the distance of this document from a target. Within this framework, we propose two probabilistic models for focused crawling, Maximum Entropy Markov Model (MEMM) and Linear-chain Conditional Random Field (CRF). With MEMM, we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With CRF, a form of undirected graphical models, we focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations. We conclude with an experimental validation and comparison with focused crawling based on Best-First Search (BFS), Hidden Markov Model (HMM), and Context-graph Search (CGS). Please hold until Dec 2 2010 yes yes
Conference Paper
Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed Web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the, performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts
EffectiveWeb Crawling
  • Carlos Castillo
Castillo, Carlos. "EffectiveWeb Crawling." ACM, 2005.
Domain-SpecificWeb Site Identification: The CROSSMARC Focused Web Crawler
  • Vangelis Karkaletsis
  • Konstantinos Stamatakis
  • James Horlock
  • Claire Grover
  • James R Curran
Karkaletsis, Vangelis, Konstantinos Stamatakis, James Horlock, Claire Grover, and James R. Curran. "Domain-SpecificWeb Site Identification: The CROSSMARC Focused Web Crawler." Proceedings of the 2nd International Workshop on Web Document Analysis (WDA2003). Edinburgh, UK, 2003.
Exploiting Multiple Features with MEMMs for Focused Web Crawling
  • Hongyu Liu
  • Evangelos Milios
  • Larry Korba
Liu, Hongyu, Evangelos Milios, and Larry Korba. "Exploiting Multiple Features with MEMMs for Focused Web Crawling." NRC, 2008.