Conference PaperPDF Available

Architecture of a WebCrawler

July 2010

July 2010

Conference: National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
At: SRKR Engineering College, Bhimavaram

Authors:

SRKR Engineering College

WebCrawler is the comprehensive full-text search engine for the WorldWide Web. Its invention and subsequent evolution helped the Web " s growth by creating a new way of navigating hypertext. WebCrawler assists users in their Web navigation by automating the task of link traversal, creating a searchable index of the web and fulfilling searchers " queries from the index. The Web is a context in which traditional Information Retrieval methods are challenged and given the volume of the Web and its speed of change. Server log files provide domain types, time of access, keywords and search engine used by visitors and can provide some insight into how a visitor arrived at a website and what keywords they used to locate it. Data mining solutions come in many types, such as association, segmentation, clustering, classification (prediction), visualization and optimization. Using a data mining tool incorporating a machine-learning algorithm, a website database can be segmented into unique groups of visitors each with individual behavior. Based on the research studies and feedbacks provided by data processing group, gather a set of data that should be stored and some possible queries that other researchers may be interested in.

Breadth-first (left) and dept-first (right) crawling of a simple Web tree.

…

Performance of a crawler sampling Web pages at random.

…

depicts the typical architecture of a largescale Web crawler. By a large-scale crawler we mean a system capable of gathering billions of documents from the current World Wide Web. It is clear that with such a huge amount of data more sophisticated techniques must be applied than simply parsing HTML files and downloading documents from the URLs extracted from there. As we may see at the picture, much attention is paid to the problems of avoiding Web pages (URLs) already visited before, parallelizing crawling (fetching threads) and balancing the load of Web servers from which documents are obtained (server queues), and speeding up the access to Web servers (via DNS Caching).

…

Figures - uploaded by Kondala Kameswara Rao Nynalasetti

Content may be subject to copyright.

Content uploaded by Kondala Kameswara Rao Nynalasetti

Content may be subject to copyright.

National Conference on Research Trends in Technology in Information Technology (Converg-2K10)

- 58 - | P a g e

Architecture of a WebCrawler

N. K. Kameswara Rao a, Dr. G. P. Saradhi Varma b

a Professor, Department of Computer Science, Pragati Engineering College, India, nkkamesh@gmail.com

b Professor& HOD, Department of IT, SRKR Engineering College, India, gpsvarma@yahoo.com

Abstract

WebCrawler is the comprehensive full-text search engine for the World-Wide Web. Its invention and subsequent

evolution helped the Web‟s growth by creating a new way of navigating hypertext. WebCrawler assists users in their Web

navigation by automating the task of link traversal, creating a searchable index of the web and fulfilling searchers‟ queries from the

index. The Web is a context in which traditional Information Retrieval methods are challenged and given the volume of the Web

and its speed of change. Server log files provide domain types, time of access, keywords and search engine used by visitors and

can provide some insight into how a visitor arrived at a website and what keywords they used to locate it. Data mining solutions

come in many types, such as association, segmentation, clustering, classification (prediction), visualization and optimization.

Using a data mining tool incorporating a machine-learning algorithm, a website database can be segmented into unique groups of

visitors each with individual behavior. Based on the research studies and feedbacks provided by data processing group, gather a set

of data that should be stored and some possible queries that other researchers may be interested in.

Introduction

Web crawlers are programs that exploit the graph

structure of the Web to move from page to page. In their

infancy such programs were also called wanderers, robots,

spiders, fish, and worms, words that are quite evocative of

Web imagery. It may be observed that the noun „crawler‟ is not

indicative of the speed of these programs, as they can be

considerably fast.

From the beginning, a key motivation for designing

Web crawlers has been to retrieve Web pages and add them or

their representations to a local repository. Such a repository

may then serve particular application needs such as those of a

Web search engine. In its simplest form a crawler starts from a

seed page and then uses the external links within it to attend to

other pages. The process repeats with the new pages offering

more external links to follow, until a sufficient number of

pages are identified or some higher level objective is reached.

Behind this simple description lies a host of issues related to

network connections, spider traps, canonicalizing URLs,

parsing HTML pages, and the ethics of dealing with remote

Web servers.

Were the Web a static collection of pages we would

have little long term use for crawling. Once all the pages had

been fetched to a repository (like a search engine's database),

there would be no further need for crawling. However, the

Web is a dynamic entity with subspaces evolving at differing

and often rapid rates. Hence there is a continual need for

crawlers to help applications stay current as new pages are

added and old ones are deleted, moved or modified.

General purpose search engines serving as entry

points to Web pages strive for coverage that is as broad as

possible. They use Web crawlers to maintain their index

databases amortizing the cost of crawling and indexing over

the millions of queries received by them. These crawlers are

blind and exhaustive in their approach, with

comprehensiveness as their major goal. In contrast, crawlers

can be selective about the pages they fetch and are then

referred to as preferential or heuristic-based crawlers. These

may be used for building focused repositories, automating

resource discovery, and facilitating software agents.

Preferential crawlers built to retrieve pages within a certain

topic are called topical or focused crawlers.

There are several dimensions about topical crawlers

that make them an exciting object of study. One key question

that has motivated much research is: How is crawler selectivity

to be achieved? Rich contextual aspects such as the goals of

the parent application, lexical signals within the Web pages

and also features of the graph built from pages already seen -

these are all reasonable kinds of evidence to exploit.

Additionally, crawlers can and often do differ in their

mechanisms for using the evidence available to them.

A second major aspect that is important to consider

when studying crawlers, especially topical crawlers, is the

nature of the crawl task. Crawl characteristics such as queries

and/or keywords provided as input criteria to the crawler, user-

profiles, and desired properties of the pages to be fetched

(similar pages, popular pages, authoritative pages etc.) can lead

to significant differences in crawler design and

implementation. The task could be constrained by parameters

like the maximum number of pages to be fetched (long crawls

vs. short crawls) or the available memory. Hence, a crawling

task can be viewed as a constrained multi-objective search

problem. However, the wide variety of objective functions,

coupled with the lack of appropriate knowledge about the

search space, makes the problem a hard one. Furthermore, a

crawler may have to deal with optimization issues such as local

vs. global optima.

The last key dimension is regarding crawler

evaluation strategies necessary to make comparisons and

determine circumstances under which one or the other crawlers

work best. Comparisons must be fair and made with an eye

towards drawing out statistically significant differences. Not

only does this require a sufficient number of crawl runs but

also sound methodologies that consider the temporal nature of

crawler outputs. Significant challenges in evaluation include

the general unavailability of relevant sets for particular topics

or queries. Thus evaluation typically relies on defining

measures for estimating page importance.

Web crawling issues

There are two important characteristics of the Web

that generate a scenario in which Web crawling is very

difficult: its large volume and its rate of change, as there is a

huge amount of pages being added, changed and removed

National Conference on Research Trends in Technology in Information Technology (Converg-2K10)

- 59 - | P a g e

every day. Also, network speed has improved less than current

processing speeds and storage capacities. The large volume

implies that the crawler can only download a fraction of the

Web pages within a given time. The high rate of change

implies that by the time the crawler is downloading the last

pages from a site, it is very likely that new pages have been

added to the site, or that pages that have already been updated

or even deleted.

Crawling the Web in a certain way resembles

watching the sky in a clear night: what we see reflects the state

of the stars at different times, as the light travels different

distances. What a Web crawler gets is not a “snapshot” of the

Web, because it does not represents the Web at any given

instant of time. The last pages being crawled are probably very

accurately represented, but the first pages that were

downloaded have a high probability of have been changed.

This idea is depicted in Figure 1. As Edwards et al. note,

“Given that the bandwidth for conducting crawls is neither

infinite nor free it is becoming essential to crawl the Web in a

not only scalable, but efficient way if some reasonable measure

of quality or freshness is to be maintained”. A crawler must

carefully choose at each step which pages to visit next.

Figure -1 Search Engine View

In figure-1 the crawling process takes time and the

Web is very dynamic, the search engines view of the Web

represents the state of Web pages at different times. This is

similar to watching the sky at night, as the stars we see never

existed simultaneously as we see them.

The behavior of a Web crawler is the outcome of a

combination of policies:

• A selection policy that states which pages to download.

• A re-visit policy that states when to check for changes to the

pages.

• A politeness policy that states how to avoid overloading Web

sites.

• A parallelization policy that states how to coordinate

distributed Web crawlers.

1. Selection policy

Given the current size of the Web, even large search

engines cover only a portion of the publicly available content;

a study by Lawrence and Giles showed that no search engine

indexes more than 16% of the Web. As a crawler always

downloads just a fraction of the Web pages, it is highly

desirable that the downloaded fraction contains the most

relevant pages, and not just a random sample of the Web. This

requires a metric of importance for prioritizing Web pages.

The importance of a page is a function of its intrinsic quality,

its popularity in terms of links or visits, and even of its URL.

Designing a good selection policy has an added difficulty: it

must work with partial information, as the complete set of Web

pages is not known during crawling.

2. Re-visit policy

The Web has a very dynamic nature, and crawling a

fraction of the Web can take a long time, usually measured in

weeks or months. By the time a Web crawler has finished its

crawl, many events could have happened. We characterize

these events as creations, updates and deletions. Creations

When a page is created, it will not be visible on the public web

space until it is linked, so we assume that at least one page

update-adding a link to the new Web page-must occur for a

Web page creation to be visible. A Web crawler starts with a

set of starting URLs, usually a list of domain names, so

registering a domain name can be seen as the act of creating a

URL. Also, under some schemes of cooperation the Web

server could provide a list of URLs without the need of a link.

Updates Page changes are difficult to characterize: an update

can be either minor, or major. An update is minor if it is at the

paragraph or sentence level, so the page is semantically almost

the same and references to its content are still valid. On the

contrary, in the case of a major update, all references to its

content are not valid anymore. It is customary to consider any

update as major, as it is difficult to judge automatically if the

page‟s content is semantically the same. Deletions A page is

deleted if it is removed from the public Web, or if all the links

to that page are removed. Note that even if all the links to a

page are removed, the page is no longer visible in the Web site,

but it will still be visible by the Web crawler. It is almost

impossible to detect that a page has lost its entire links, as the

Web crawler can never tell if links to the target page are not

present, or if they are only present in pages that have not been

crawled. Cost functions From the search engine‟s point of

view, there is a cost associated with not detecting an event, and

thus having an outdated copy of a resource. The most used cost

functions are freshness and age.Freshness This is a binary

measure that indicates whether the local copy is accurate or

not. The freshness of a page p in the repository at time t is

defined as:

Fp(t) =1 if p is equal to the local copy at time t

0 otherwise

Age This is a measure that indicates how outdated the local

copy is. The age of a page p in the repository, at time t is

defined as: Ap(t) = 0 if p is not modified at time t

t −modification time of p otherwise

The evolution of these two quantities is depicted in Figure 2.

National Conference on Research Trends in Technology in Information Technology (Converg-2K10)

- 60 - | P a g e

Figure-2 The Evolution of Freshness and Age

The Figure 2 shows Evolution of freshness and age

with time. Two types of event can occur: modification of a

Web page in the server (event modify) and downloading of the

modified page by the crawler (event sync)

Explicit formulas for the re-visit policy are not

attainable in general, but they are obtained numerically, as they

depend on the distribution of page changes. Note that the re-

visiting policies considered here regard all pages as

homogeneous in terms of quality –all pages on the Web are

worth the same– something that is not a realistic scenario, so

further information about the Web page quality should be

included to achieve a better crawling policy.

3. Politeness policy

As noted by Koster, the use of Web robots is useful

for a number of tasks, but comes with a price for the general

community. The costs of using Web robots include:

• Network resources, as robots require considerable bandwidth,

and operate with a high degree of parallelism during a long

period of time.

• Server overload, especially if the frequency of accesses to a

given server is too high.

• Poorly written robots, which can crash servers or routers, or

which download pages they cannot handle.

• Personal robots that, if deployed by too many users, can

disrupt networks and Web servers.

A partial solution to these problems is the robots

exclusion protocol that is a standard for administrators to

indicate which parts of their Web servers should not be

accessed by robots. This standard does not include a

suggestion for the interval of visits to the same server, even

though this interval is the most effective way of avoiding

server overload.

It is worth noticing that even when being very polite,

and taking all the safeguards to avoid overloading Web

servers, some complaints from Web server administrators are

received. Brin and Page note that: “... running a crawler which

connects to more than half a million servers (...) generates a

fair amount of email and phone calls. Because of the vast

number of people coming on line, there are always those who

do not know what a crawler is, because this is the first one they

have seen.”

4. Parallelization policy

A parallel crawler is a crawler that runs multiple

processes in parallel. The goal is to maximize the download

rate while minimizing the overhead from parallelization and to

avoid repeated downloads of the same page.

To avoid downloading the same page more than once, the

crawling system requires a policy for assigning the new URLs

discovered during the crawling process, as the same URL can

be found by two different crawling processes. Cho and Garcia-

Molina studied two types of policy:

Dynamic assignment With this type of policy, a

central server assigns new URLs to different crawlers

dynamically. This allows the central server to, for instance,

dynamically balance the load of each crawler.

With dynamic assignment, typically the systems can

also add or remove downloader processes. The central server

may become the bottleneck, so most of the workload must be

transferred to the distributed crawling processes for large

crawls.

There are two configurations of crawling architectures with

dynamic assignment that have been described by Shkapenyuk

and Suel:

• A small crawler configuration, in which there is a central

DNS resolver and central queues per Web site, and distributed

downloaders.

• A large crawler configuration, in which the DNS resolver and

the queues are also distributed.

Static assignment With this type of policy, there is a fixed

rule stated from the beginning of the crawl that defines how to

assign new URLs to the crawlers.

For static assignment, a hashing function can be used

to transform URLs (or, even better, complete Web site names)

into a number that corresponds to the index of the

corresponding crawling process. As there are external links

that will go from a Web site assigned to one crawling process

to a Web site assigned to a different crawling process, some

exchange of URLs must occur.

To reduce the overhead due to the exchange of URLs

between crawling processes, the exchange should be done in

batch, several URLs at a time, and the most cited URLs in the

collection should be known by all crawling processes before

the crawl (e.g.: using data from a previous crawl) .

Architecture of a Web Crawler

Figure 3 depicts the typical architecture of a large-

scale Web crawler. By a large-scale crawler we mean a system

capable of gathering billions of documents from the current

World Wide Web. It is clear that with such a huge amount of

data more sophisticated techniques must be applied than

simply parsing HTML files and downloading documents from

the URLs extracted from there. As we may see at the picture,

much attention is paid to the problems of avoiding Web pages

(URLs) already visited before, parallelizing crawling (fetching

threads) and balancing the load of Web servers from which

documents are obtained (server queues), and speeding up the

access to Web servers (via DNS Caching).

Figure 3: Architecture of a typical Web crawler

National Conference on Research Trends in Technology in Information Technology (Converg-2K10)

- 61 - | P a g e

The role of Web crawling

Since Web crawling is at the heart of each Web search

engine, rather general architectural descriptions of crawlers

without important details have appeared so far. Commercial

search engines treat their Web crawling techniques as business

secrets and prefer not to give their rivals a chance to take

advantage of their know-how. Another reason is to keep

essential information on crawling away from search engine

spammers who would abuse the information. Some of the

crawler architectures published are that of Alexa which is still

the Web robot of the Internet Archive, an early version of

Googlebot, being the crawler of Google, Mercator, which was

the spider of AltaVista, Ubicrawler, and Dominos.

In general, a Web crawler takes a URL from the

queue of pending URLs, it downloads a new page from the

URL, it stores the document to a repository and it parses its

text to find hyperlinks to URLs, which it then enqueues in the

queue of pending URLs in case they have not yet been

downloaded (“fetched”). Ideally, crawling is stopped when the

queue of pending URLs is empty. In practice, however, this

will never happen as the universe of a large-scale Web crawler

is almost infinite. The Web is steadily changing and will never

be crawled as a whole. So a reasonable terminating condition

must be set up for the crawler to stop. For example, a certain

number of documents have been fetched, a specific number of

terabytes of data has been downloaded, a particular time period

has elapsed, or the crawler simply runs out of resources (main

memory, storage capacities, etc.).

Internals

More specifically, a Web spider would like to do

many activities in parallel in order to speed up the process of

crawling. In fact, the DNS name resolving, i.e. getting IP

address of an Internet host by contacting specific servers with

name-to-IP mappings, and opening an HTTP connection to a

Web server may take up to a second which is often more than

receiving the response from a Web server (i.e. downloading a

small or middle-sized document with a sufficiently fast

connection). So the natural idea is to fetch many documents at

a time.

Current commercial large-scale Web robots fetch up

to several thousands of documents in parallel and crawl the

“whole” Web (billions of documents) within a couple of

weeks. Interestingly, parallelization objects offered by

operating systems such as processes and threads do not seem

advantageous for multiple fetching of thousands of documents

due to thread (process) synchronization overheads. Instead, a

non-blocking fetching via asynchronous sockets is preferred.

Indeed, present commercial search engines work with such

huge amounts of data that they have to use technologies that

are often beyond capabilities of traditional operating systems.

Google, for example, has a file system of its own.

Implementors of large-scale Web crawlers try to

reduce the host name resolution time by means of DNS

caching. The DNS server mapping host names to their IP

addresses is customized and extended with a DNS cache and a

prefetching client. The cache is preferably placed in the main

memory for a very fast lookup in the table of names and IPs. In

this way, server names that have already been put in the cache

before can be found almost immediately.

New names, though, have still to be searched for on

distant DNS servers. Therefore, the prefetching client sends

requests to the DNS server right after URL extraction from a

downloaded page and does not wait until the resolution

terminates (non-blocking UDP datagrams are sent). Thus, the

cache is filled up with corresponding IPs long before they are

actually needed. (DNS requests are kept completely away from

a common Web surfer. It is the Web browser that gets all the

work done.)

Avoiding redundancy

The biggest task of a crawler is to avoid redundancy by

eliminating duplicate pages and links from the crawl. A

crawler that does not respect this may easily end up in a spider

trap – an infinite loop of links between the same pages. Such a

trapped spider can “crawl” the Web for ages and collect

petabytes of data, but it will be useless, because it gets stuck in

just one place of the Web. There must be a module

(isUrlVisited?) that checks whether or not a page has been

already fetched before putting its URL to the working pool of

pending documents (sometimes called frontier). The intuitive

solution is to have a list of URLs already visited and to

compare each newly extracted URL against this list.

Unfortunately, many problems arise here:

 Different forms of URLs. URLs occur in various forms.

They may be absolute or relative, they may or may not

include port numbers, fragments, or queries that may

contain special or even non-latin characters, they may be

in lower case or upper case, etc. Before we can attempt to

compare URLs, we have to normalize them and produce

the so-called canonical form. In this form, every URL is

absolute, with the host name in lower case, without non-

latin characters and so on.

 Too many URLs. To crawl a significant portion of the

Web, we would need to store somewhere a few billions

URLs for further comparisons. Imagine that an average

normalized URL is fifty characters long. Even for a one-

billion-pages crawl, a storage capacity of 50 billion bytes

(50 GB) would be required. Moreover, access to the list of

URLs visited must be very fast as the check will be very

frequent. How to resolve this difficulty? We can

somewhat reduce the size of URLs by encoding them into

MD5 fingerprints or CRC checksums. These fingerprints

may be four to eight bytes long according as how many

URLs we suppose to crawl. In addition, we can use each

fingerprint as a hash and store the URLs in a hash table on

disk. Disk seeks will still be slow, but we can improve this

with a two-level hashing – host name hashing and path

hashing will be done separately for each URL

 Duplicate pages with different URLs. Even if we are

careful enough and never crawl the same URL twice, we

can still download pages with the same content if they

have different URLs. In order to avoid adding links to the

frontier that appear as new, because they are relative to the

page with a different URL but with a duplicate content,

but in reality have been added before, it is necessary for

each newly fetched page to check whether it has been

downloaded yet (isPageKnown? module in Figure 3).

Again, we can use the MD5 hash function here. We will

maintain a list of fingerprints of fetched pages‟ contents and

compare each new page against it. Unfortunately, only a very

small difference between two pages that are otherwise

considered as duplicate, such as a different time stamp at the

National Conference on Research Trends in Technology in Information Technology (Converg-2K10)

- 62 - | P a g e

bottom of the page, results in distinct fingerprints, and the

duplicates recognition fails. Thus, the process must be

enhanced by a technique called shingling, which detects near

duplicates.

Care must be taken not to overload Web servers with

requests. Not only does it prevent a denial of service by the

Web servers, but it is also a measure of politeness to other Web

users. Ideally, the load monitor & manager distribute requests

evenly among servers, for each of which there is a queue

pending URLs. It controls that the interval between two

requests sent to the same server be no less than, say, a minute.

Besides others, fetching pages uniformly from distinct servers

reduces the risk of getting stuck in a spider trap.

Dynamic pages

We have seen that the greatest danger for a Web

crawler consists in not recognizing that a Web page has

already been fetched before. If this happens, the spider may

easily crawl a very small part of the Web infinitely long. The

main sources of such difficulties are page duplication and site

mirroring (i.e. duplication of whole Web sites), dynamically

generated pages and Web host aliases. A computer with a

certain IP address may be represented by one or more host

names (virtual servers). On the other hand, a Web site may be

hosted by several machines with distinct IPs. This many-to-

many relation between host names and IPs along with aliases

(synonymous names of a Web site) makes the recognition of

known URLs even more difficult. Besides shingling for

duplicate pages, there exist techniques for the detection of

mirrored Web sites that may help resolve this problem as well.

But by far the biggest trouble is with dynamic pages such as

CGI, PHP, or Java scripts.

Dynamic pages are dangerous in that they can

generate whatever content (including what we are not at all

interested in), and that their number may be virtually infinitely

large. Dynamic pages often contain generated URLs that differ

only in one parameter of their query part. Also, they are often

results of a database query depending on what the Web user

types in a Web form, etc. It is feasible to store nor fingerprints

of their URLs neither of their contents because of their

immense number. How can we overcome this problem? The

most robust spider would just ignore dynamic pages. However,

it would probably miss a lot of important data. There have

even been attempts to crawl the hidden Web behind Web

forms . In practice, we must still observe crawling statistics

and set bounds for various parameters such as the number of

documents gathered on a site or the crawling depth (i.e. the

number of links followed leading to the current page).

Whenever a bound is exceeded, crawling as a whole or just on

that particular site is stopped. For example, for crawling

Baeza-Yates recommends five for static pages and fifteen for

dynamic pages.

Crawling strategies

Assume for simplicity that we are to crawl a small

part of the Web that is a tree. Because we are sure that this part

of the Web is finite and that we are going to visit all of its

pages, we can arbitrarily choose one of the two basic crawling

methods – breadth-first or depth-first crawling. Let us recall

that with breadth-first crawling, we first visit nodes with the

same distance (number of links) from the root node. The data

structure used here to store links extracted from pages is a

queue. On the other hand, in depth-first crawling, we follow

links as

deep as we can. We put them on a stack. See Figure 4 for a

small example. Which of the two strategies is better? In this

simple case, they are the same provided we are not interested

in the order of visiting individual pages. At the end, we will

have a set of documents which we can, for example, add to a

corpus and build an index on it.

Figure 4: Breadth-first (left) and dept-first (right) crawling of a

simple Web tree.

In practice, however, neither is the Web graph a tree

nor can we collect all documents. Thus, if we know that we

will not be able to crawl all pages, we would like to crawl the

more important ones at least. Therefore, we expect a good

crawling strategy to visit more important pages sooner during

the crawl than a bad crawling strategy. Here, we only associate

a value of significance with each Web page and set the total

significance of all pages in the Web graph to be crawled to be

one. Then, at any time point of the crawl, we can plot the

importance value of all the pages crawled so far against the

fraction of the total number of pages to crawl.

Figure 5: Performance of a crawler sampling Web pages at

random.

In a crawl, where pages would be picked up randomly

(and uniformly) from the graph, the plot would be

approximately diagonal like in Figure 5. The diagonal line may

be considered as a baseline, and any crawler whose

performance curve plotted on the chart is above the diagonal

line is a more effective spider. Of course, normally we know

neither the total number of pages on the Web nor their

importance. Therefore, this measurement is possible for

synthetic (artificial) graphs, when the number of pages and

their importance are known before, or for pre-crawled Web

graphs with all the values required already computed. In both

cases, we call these “artificial” spiders crawling simulators.

Alternatively, we can measure crawling performance

retroactively and compute all the values when the crawl has

finished.

Baeza-Yates defines three groups of crawling strategies:

National Conference on Research Trends in Technology in Information Technology (Converg-2K10)

- 63 - | P a g e

 With no extra information. When deciding which page to

crawl next, the spider has no additional information

available except knowing the structure of the Web crawled

so far in the current crawl.

 With historical information. The crawler additionally

knows the Web graph obtained in a recent “complete”

crawl. This is common for search engine spiders that

regularly crawl the Web in several-week intervals.

Typically, the spider knows what pages existed a couple of

weeks ago, what links they contained and what importance

the pages had which was computed after the crawl.

Although the Web changes very fast (about 25% new links

are created every week, the historical data were too costly

to acquire so that it could be entirely neglected. Thus, the

selection of a next page to crawl will be based on the

historical information.

 With all information. This is a theoretical strategy not

usable in a real Web crawl. We will call it the omniscient

method, which perfectly knows the whole Web graph that

should be crawled including the values of importance of

individual pages. This method always chooses the page

with the highest importance from the frontier.

Crawling strategies with no extra information

 Breadth-first. We mentioned this technique earlier. It is

reported to collect high quality (important) pages quite

soon. On the other hand, depth-first strategies are not much

used in real Web crawling, also because the maximum

crawling depth is worse controllable in them.

 Back link-count. Pages in the frontier with a higher

number of in-links from pages already downloaded have a

higher priority of crawl.

 Batch-PageRank. This technique calculates PageRank

values for the pages in the frontier after downloading every

k pages. Of course, these PageRanks are based on the graph

constituted of the pages downloaded so far, and they are

only estimates of the real PageRanks derived from the

whole Web graph. After each re-calculation, the frontier is

prioritized according to the estimated PageRank and the top

k pages will be downloaded next.

 Partial-PageRank. It is like Batch-PageRank but with

temporary PageRanks assigned to new pages until a new re-

calculation is done. These temporary PageRanks are

computed non-iteratively unlike normal PageRanks as the

sum of PageRanks of inlinking pages divided by the

number of out-links of those pages (the so-called out-link

normalization).

 OPIC. This technique may be considered as a weighted

back link-count strategy.

 Larger-sites-first. This method tries to cope best with

the rule that Web sites must not be overloaded and choose

preferentially pages from Web sites having a large number

of pending pages. The goal is not to have at the end of the

crawl a small number of large sites, because that would

slower down crawling due to the delay required between

two accesses to the same site.

Crawling strategies with historical information

Again, we would like to order the pages in the frontier by

their PageRank and crawl the more important ones first. For

the pages encountered in the current crawl that existed when

the last crawl was run, we use their historical PageRank even

though we are aware that their current PageRank may have

changed. The pages that did not exist then have to be assigned

some estimates. There are several methods how to deal with

these new pages:

 Historical-PageRank-Omniscient. Again, it is a

theoretical variant which knows the complete graph and

assigns “true” PageRanks to the new pages.

 Historical-PageRank-Random. It assigns to the new

pages random PageRanks chosen from those computed for

the previous crawl.

 Historical-PageRank-Zero. New pages are all assigned a

zero PageRank and are thus crawled after “old” pages.

 Historical-PageRank-Parent. Each new page is assigned

an out-link-normalized PageRank of its parent page(s)

linking to it. If a parent page is new as well (there is no

historical PageRank associated with it) we obviously

proceed to the grandparent and so forth.

Enhancing Web Data

Server log files provide domain types, time of access,

keywords and search engine used by visitors and can provide

some insight into how a visitor arrived at a website and what

keywords they used to locate it. Cookies dispensed from the

server can track browser visits and pages viewed and can

provide some insight into how often this visitor has been to the

site and what sections they wander into.

Mining Web Data

So far most analyses of web data have involved log

traffic reports, most of which provide cumulative accounts of

server activity but do not provide any true business insight

about customer demographics and online behavior. Most of the

current traffic analysis software, including NetIntellect, Bazaar

Analyzer Pro, HitList, NetTracker, Surf Report, WebTrends,

and others offer predefined reports about server activity based

on the analysis of log files. This basically limits the scope of

these tools to statistics about domain names, IP addresses,

cookies, browsers and other TCP/IP specific machine-to-

machine activity.

On the other hand, the mining of web data for an e-

commerce site yields visitor behavior analyses and profiles,

rather than server statistics. An e-commerce site needs to know

about the preferences and lifestyles of its visitors. Data mining

in this context is about addressing business questions such as:

Who is buying what items and at what rates. You also would

like to know what is selling so you can adjust your inventory

and plan your orders and shipping. You need to know how to

sell and what incentives, offers and ads work, and how you

should design your site to optimize your profits.

Ten Steps to Mining Web Data

1. Plan Your Project: Identify Your Objective

The mining of website involves some advanced planning

about what type and level of information intend to capture

at the server and what additional data plan to match it

with. This by itself will ensure the data mining efforts will

yield measurable business results. For example, need to

plan with the web team what kind of log, cookie and form

information you intend to capture at what juncture from

the visitors.

2. Select Your Data

National Conference on Research Trends in Technology in Information Technology (Converg-2K10)

- 64 - | P a g e

Once your business objective has been defined, we must

then select the web server and company data for meeting

this goal.

3. Prepare the Data

Once the data has been assembled and visually inspected,

we must decide which attributes to exclude and which

attributes need to be converted into usable formats.

4. Evaluate the Data

We should evaluate the data's structure to determine what

type of data mining tools to use for the analysis

5. Format the Solution

There are a number of web mining formats or solutions.

Evaluate the web data and set the business objectives must

select the format of e- commerce solution.

6. Select the Tools

To choose the right mining tool, select not only the right

technology, but also consider the characteristics and

structure of the data.

Number of continuous value fields, Number of dependent

variables, Number of categorical fields, Length and type

of records, “Skewness" of the data set

7. Construct the Models

It is not until this stage that actually being mining the web

site files. Again, during the mining process, search for

patterns in data sets and generate classification rules,

decision trees, clustering, scores and weights, and evaluate

and compare error rates.

8. Validate the Findings

A data mining analysis of the web site will most likely

involve individuals from several departments, such as

information systems, marketing, sales, inventory, etc. It

most definitely will involve the administrators, designers,

analysts, managers.

9. Deliver the Findings

A report should be prepared documenting the entire web

mining process, including the steps took in selecting and

preparing the data, the tools you used and why, the tool

settings, findings, and an explanation of the code that was

generated is supposed

10. Integrate the Solutions

This process involves incorporating the findings into your

firm's business practices, marketing efforts, and strategic

planning. Web mining is a pattern-recognition process

involving hundreds, thousands or maybe millions of daily

transactions in your web site.

Conclusion

Web crawling is not only a trivial graph traversal

problem. It involves several issues that arise from the

distributed nature of the Web. First, Web crawlers must share

resources with other agents, mostly with humans, and cannot

monopolize Web sites‟ time –indeed, a Web crawler should try

to minimize its impact on Web sites. Second, Web crawlers

have to deal with an information repository which contains

many objects of varying quality, including objects with very

low quality created to lure the Web crawler and deceive

ranking schemes.

While the model implies that all the portions of the

search engine should know all the properties of the Web pages,

the architecture introduced in this survey is an attempt of

separating these properties into smaller units (text, link graph,

etc.) for better scalability.

Using Data mining solutions such as association,

segmentation, clustering, classification (prediction),

visualization and optimization and a data mining tool

incorporating a machine-learning algorithm, a website database

can be segmented into unique groups of visitors each with

individual behavior. Based on the research studies and

feedbacks provided by data processing group, gather a set of

data that should be stored and some possible queries that other

researchers may be interested in. Recommend a data schema to

be used for storing Internet data, as well as a possible

processing order for data loading.

References

1. Baeza-Yates R., Castillo C. Crawling the infinite Web: five

levels are enough. Proceedings

of the third Workshop on Web Graphs (WAW), Rome, Italy,

Lecture Notes in Computer Science, Springer, vol. 3243, pp.

156-167, 2004.

2. Baeza-Yates R., Castillo C., Marín M., Rodríguez A.

Crawling a country: better strategies than breadth-first for

web page ordering. Proceedings of the 14th international

conference on World Wide Web (WWW 2005), Chiba, Japan,

pp. 864-872, 2005.

3. Boldi P., Codenotti B., Santini M., Vigna S. UbiCrawler: a

scalable fully distributed Web crawler. Software Practice and

Experience, vol. 34, no. 8, pp.711-726, 2004.

4. Brin S., Page L. The Anatomy of a Large-Scale Hypertextual

Web Search Engine. Proceedings of the 7th World Wide Web

Conference, pp. 107 – 117, 1998.

5. Bröder A., Kumar R., Maghoul F., Raghavan P.,

Rajagopalan S., Stata R., Tomkins A., Wiener J. Graph

structure in the Web. Computer Networks vol. 33, no. 1-6, pp.

309–320, 2000.

6. Chakrabarti S., Dom B. E., Gibson D., Kumar R., Raghavan

P., Rajagopalan S., Tomkins A. Spectral Filtering for Resource

Discovery. Proceedings of the ACM SIGIR Workshop on

Hypertext Information Retrieval on the Web, Melbourne,

Australia, pp. 13-21, 1998.

7. Chakrabarti S. Mining the Web: Analysis of Hypertext and

Semi Structured Data. Morgan Kaufmann Publishers, San

Francisco, California, USA, 2002.

8. Cho J, Shivakumar N., Garcia-Molina H. Finding Replicated

Web Collections. Proceedings of the 2000 ACM SIGMOD

International Conference on Management of Data, Dallas,

Texas, USA, pp. 355-366, 2000.

9. Cho J., Garcia-Molina H. Parallel crawlers. Proceedings of

the 11th international conference on the World Wide Web

(WWW‟02), Honolulu, Hawaii, USA, pp. 124-135, 2002.

ResearchGate has not been able to resolve any citations for this publication.

Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering

Conference Paper

Full-text available

Jan 2005

This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.

Crawling the Infinite Web: Five Levels Are Enough

Conference Paper

Full-text available

Jul 2004
Lect Notes Comput Sci

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. This poses a problem to search engine managers: they need a rule to configure the search engine's crawler in such a way that it stops downloading pages from each Web site at some depth. But how deep must the crawler go? In this article, several probabilistic models for browsing "infinite" Web sites are proposed and studied. We use these models to estimate how deep a search engine must go to download a significant portion of the Web site content that is actually visited.

Spectral Filtering for Resource Discovery

Article

Full-text available

Jan 1999

We develop a technique we call spectral filtering, for discovering high-quality topical resources in hyperlinked corpora. Through relevance and quality judgements collected from 37 users, we show that, over 26 topics, spectral filtering usually finds web pages that are rated better than those returned by the hand-compiled Yahoo! resource list, and by the Altavista search engine. 1 Introduction As the amount of information available on-line grows exponentially, the user's capacity to digest that information cannot keep pace [24, 30]. In this paper we address the problem of distilling high-quality sources of information on broad topics in hyperlinked corpora (such as the WWW). As an example, we seek algorithms that can answer the question: "What are the twenty best pages on the web about History?" There are over five million pages on the web today that contain the word history; we seek only twenty, but they must be exemplary. The authors of [3] present a system called ARC which addres...

Finding replicated web collections

Conference Paper

Jan 1999
SIGMOD REC

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.

Mining the Web: Analysis of Hypertext and Semi Structured Data

Book

Aug 2002

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Conference Paper

Nov 1998
COMPUT NETW

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.

Graph Structure in the Web

Conference Paper

Jun 2000
COMPUT NETW

The study of the Web as a graph is not only fascinating in its own right, but also yields valuable insight into Web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the Web graph using two AltaVista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the Web is considerably more intricate than suggested by earlier experiments on a smaller scale.

UbiCrawler: A Scalable Fully Distributed Web Crawler

Article

Jul 2004
SOFTWARE PRACT EXPER

We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.

Spectral Filtering for Resource Discovery

Jan 1998
13-21

S Chakrabarti
B E Dom
D Gibson
R Kumar
P Raghavan
S Rajagopalan
A Tomkins

Chakrabarti S., Dom B. E., Gibson D., Kumar R., Raghavan P., Rajagopalan S., Tomkins A. Spectral Filtering for Resource Discovery. Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, Melbourne, Australia, pp. 13-21, 1998.

Architecture of a WebCrawler

Abstract and Figures

Recommended publications

HDM: A Client/Server/Engine Architecture for Real-Time Web Usage Mining

An Approach for Restructuring of Web Pages

Time Dependent Approach for Query and URL Recommendations Using Search Engine Query Logs

Mining dibrugarh university distance education website and Bayesian approach for web proxy cache man...