Conference PaperPDF Available

Architecture of a WebCrawler

Authors:

Abstract and Figures

WebCrawler is the comprehensive full-text search engine for the WorldWide Web. Its invention and subsequent evolution helped the Web " s growth by creating a new way of navigating hypertext. WebCrawler assists users in their Web navigation by automating the task of link traversal, creating a searchable index of the web and fulfilling searchers " queries from the index. The Web is a context in which traditional Information Retrieval methods are challenged and given the volume of the Web and its speed of change. Server log files provide domain types, time of access, keywords and search engine used by visitors and can provide some insight into how a visitor arrived at a website and what keywords they used to locate it. Data mining solutions come in many types, such as association, segmentation, clustering, classification (prediction), visualization and optimization. Using a data mining tool incorporating a machine-learning algorithm, a website database can be segmented into unique groups of visitors each with individual behavior. Based on the research studies and feedbacks provided by data processing group, gather a set of data that should be stored and some possible queries that other researchers may be interested in.
Content may be subject to copyright.
National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
- 58 - | P a g e
Architecture of a WebCrawler
N. K. Kameswara Rao a, Dr. G. P. Saradhi Varma b
a Professor, Department of Computer Science, Pragati Engineering College, India, nkkamesh@gmail.com
b Professor& HOD, Department of IT, SRKR Engineering College, India, gpsvarma@yahoo.com
Abstract
WebCrawler is the comprehensive full-text search engine for the World-Wide Web. Its invention and subsequent
evolution helped the Web‟s growth by creating a new way of navigating hypertext. WebCrawler assists users in their Web
navigation by automating the task of link traversal, creating a searchable index of the web and fulfilling searchers‟ queries from the
index. The Web is a context in which traditional Information Retrieval methods are challenged and given the volume of the Web
and its speed of change. Server log files provide domain types, time of access, keywords and search engine used by visitors and
can provide some insight into how a visitor arrived at a website and what keywords they used to locate it. Data mining solutions
come in many types, such as association, segmentation, clustering, classification (prediction), visualization and optimization.
Using a data mining tool incorporating a machine-learning algorithm, a website database can be segmented into unique groups of
visitors each with individual behavior. Based on the research studies and feedbacks provided by data processing group, gather a set
of data that should be stored and some possible queries that other researchers may be interested in.
Introduction
Web crawlers are programs that exploit the graph
structure of the Web to move from page to page. In their
infancy such programs were also called wanderers, robots,
spiders, fish, and worms, words that are quite evocative of
Web imagery. It may be observed that the noun „crawler‟ is not
indicative of the speed of these programs, as they can be
considerably fast.
From the beginning, a key motivation for designing
Web crawlers has been to retrieve Web pages and add them or
their representations to a local repository. Such a repository
may then serve particular application needs such as those of a
Web search engine. In its simplest form a crawler starts from a
seed page and then uses the external links within it to attend to
other pages. The process repeats with the new pages offering
more external links to follow, until a sufficient number of
pages are identified or some higher level objective is reached.
Behind this simple description lies a host of issues related to
network connections, spider traps, canonicalizing URLs,
parsing HTML pages, and the ethics of dealing with remote
Web servers.
Were the Web a static collection of pages we would
have little long term use for crawling. Once all the pages had
been fetched to a repository (like a search engine's database),
there would be no further need for crawling. However, the
Web is a dynamic entity with subspaces evolving at differing
and often rapid rates. Hence there is a continual need for
crawlers to help applications stay current as new pages are
added and old ones are deleted, moved or modified.
General purpose search engines serving as entry
points to Web pages strive for coverage that is as broad as
possible. They use Web crawlers to maintain their index
databases amortizing the cost of crawling and indexing over
the millions of queries received by them. These crawlers are
blind and exhaustive in their approach, with
comprehensiveness as their major goal. In contrast, crawlers
can be selective about the pages they fetch and are then
referred to as preferential or heuristic-based crawlers. These
may be used for building focused repositories, automating
resource discovery, and facilitating software agents.
Preferential crawlers built to retrieve pages within a certain
topic are called topical or focused crawlers.
There are several dimensions about topical crawlers
that make them an exciting object of study. One key question
that has motivated much research is: How is crawler selectivity
to be achieved? Rich contextual aspects such as the goals of
the parent application, lexical signals within the Web pages
and also features of the graph built from pages already seen -
these are all reasonable kinds of evidence to exploit.
Additionally, crawlers can and often do differ in their
mechanisms for using the evidence available to them.
A second major aspect that is important to consider
when studying crawlers, especially topical crawlers, is the
nature of the crawl task. Crawl characteristics such as queries
and/or keywords provided as input criteria to the crawler, user-
profiles, and desired properties of the pages to be fetched
(similar pages, popular pages, authoritative pages etc.) can lead
to significant differences in crawler design and
implementation. The task could be constrained by parameters
like the maximum number of pages to be fetched (long crawls
vs. short crawls) or the available memory. Hence, a crawling
task can be viewed as a constrained multi-objective search
problem. However, the wide variety of objective functions,
coupled with the lack of appropriate knowledge about the
search space, makes the problem a hard one. Furthermore, a
crawler may have to deal with optimization issues such as local
vs. global optima.
The last key dimension is regarding crawler
evaluation strategies necessary to make comparisons and
determine circumstances under which one or the other crawlers
work best. Comparisons must be fair and made with an eye
towards drawing out statistically significant differences. Not
only does this require a sufficient number of crawl runs but
also sound methodologies that consider the temporal nature of
crawler outputs. Significant challenges in evaluation include
the general unavailability of relevant sets for particular topics
or queries. Thus evaluation typically relies on defining
measures for estimating page importance.
Web crawling issues
There are two important characteristics of the Web
that generate a scenario in which Web crawling is very
difficult: its large volume and its rate of change, as there is a
huge amount of pages being added, changed and removed
National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
- 59 - | P a g e
every day. Also, network speed has improved less than current
processing speeds and storage capacities. The large volume
implies that the crawler can only download a fraction of the
Web pages within a given time. The high rate of change
implies that by the time the crawler is downloading the last
pages from a site, it is very likely that new pages have been
added to the site, or that pages that have already been updated
or even deleted.
Crawling the Web in a certain way resembles
watching the sky in a clear night: what we see reflects the state
of the stars at different times, as the light travels different
distances. What a Web crawler gets is not a “snapshot” of the
Web, because it does not represents the Web at any given
instant of time. The last pages being crawled are probably very
accurately represented, but the first pages that were
downloaded have a high probability of have been changed.
This idea is depicted in Figure 1. As Edwards et al. note,
“Given that the bandwidth for conducting crawls is neither
infinite nor free it is becoming essential to crawl the Web in a
not only scalable, but efficient way if some reasonable measure
of quality or freshness is to be maintained. A crawler must
carefully choose at each step which pages to visit next.
Figure -1 Search Engine View
In figure-1 the crawling process takes time and the
Web is very dynamic, the search engines view of the Web
represents the state of Web pages at different times. This is
similar to watching the sky at night, as the stars we see never
existed simultaneously as we see them.
The behavior of a Web crawler is the outcome of a
combination of policies:
• A selection policy that states which pages to download.
• A re-visit policy that states when to check for changes to the
pages.
• A politeness policy that states how to avoid overloading Web
sites.
A parallelization policy that states how to coordinate
distributed Web crawlers.
1. Selection policy
Given the current size of the Web, even large search
engines cover only a portion of the publicly available content;
a study by Lawrence and Giles showed that no search engine
indexes more than 16% of the Web. As a crawler always
downloads just a fraction of the Web pages, it is highly
desirable that the downloaded fraction contains the most
relevant pages, and not just a random sample of the Web. This
requires a metric of importance for prioritizing Web pages.
The importance of a page is a function of its intrinsic quality,
its popularity in terms of links or visits, and even of its URL.
Designing a good selection policy has an added difficulty: it
must work with partial information, as the complete set of Web
pages is not known during crawling.
2. Re-visit policy
The Web has a very dynamic nature, and crawling a
fraction of the Web can take a long time, usually measured in
weeks or months. By the time a Web crawler has finished its
crawl, many events could have happened. We characterize
these events as creations, updates and deletions. Creations
When a page is created, it will not be visible on the public web
space until it is linked, so we assume that at least one page
update-adding a link to the new Web page-must occur for a
Web page creation to be visible. A Web crawler starts with a
set of starting URLs, usually a list of domain names, so
registering a domain name can be seen as the act of creating a
URL. Also, under some schemes of cooperation the Web
server could provide a list of URLs without the need of a link.
Updates Page changes are difficult to characterize: an update
can be either minor, or major. An update is minor if it is at the
paragraph or sentence level, so the page is semantically almost
the same and references to its content are still valid. On the
contrary, in the case of a major update, all references to its
content are not valid anymore. It is customary to consider any
update as major, as it is difficult to judge automatically if the
page‟s content is semantically the same. Deletions A page is
deleted if it is removed from the public Web, or if all the links
to that page are removed. Note that even if all the links to a
page are removed, the page is no longer visible in the Web site,
but it will still be visible by the Web crawler. It is almost
impossible to detect that a page has lost its entire links, as the
Web crawler can never tell if links to the target page are not
present, or if they are only present in pages that have not been
crawled. Cost functions From the search engine‟s point of
view, there is a cost associated with not detecting an event, and
thus having an outdated copy of a resource. The most used cost
functions are freshness and age.Freshness This is a binary
measure that indicates whether the local copy is accurate or
not. The freshness of a page p in the repository at time t is
defined as:
Fp(t) =1 if p is equal to the local copy at time t
0 otherwise
Age This is a measure that indicates how outdated the local
copy is. The age of a page p in the repository, at time t is
defined as: Ap(t) = 0 if p is not modified at time t
t −modification time of p otherwise
The evolution of these two quantities is depicted in Figure 2.
National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
- 60 - | P a g e
Figure-2 The Evolution of Freshness and Age
The Figure 2 shows Evolution of freshness and age
with time. Two types of event can occur: modification of a
Web page in the server (event modify) and downloading of the
modified page by the crawler (event sync)
Explicit formulas for the re-visit policy are not
attainable in general, but they are obtained numerically, as they
depend on the distribution of page changes. Note that the re-
visiting policies considered here regard all pages as
homogeneous in terms of quality all pages on the Web are
worth the same something that is not a realistic scenario, so
further information about the Web page quality should be
included to achieve a better crawling policy.
3. Politeness policy
As noted by Koster, the use of Web robots is useful
for a number of tasks, but comes with a price for the general
community. The costs of using Web robots include:
• Network resources, as robots require considerable bandwidth,
and operate with a high degree of parallelism during a long
period of time.
Server overload, especially if the frequency of accesses to a
given server is too high.
• Poorly written robots, which can crash servers or routers, or
which download pages they cannot handle.
• Personal robots that, if deployed by too many users, can
disrupt networks and Web servers.
A partial solution to these problems is the robots
exclusion protocol that is a standard for administrators to
indicate which parts of their Web servers should not be
accessed by robots. This standard does not include a
suggestion for the interval of visits to the same server, even
though this interval is the most effective way of avoiding
server overload.
It is worth noticing that even when being very polite,
and taking all the safeguards to avoid overloading Web
servers, some complaints from Web server administrators are
received. Brin and Page note that: “... running a crawler which
connects to more than half a million servers (...) generates a
fair amount of email and phone calls. Because of the vast
number of people coming on line, there are always those who
do not know what a crawler is, because this is the first one they
have seen.”
4. Parallelization policy
A parallel crawler is a crawler that runs multiple
processes in parallel. The goal is to maximize the download
rate while minimizing the overhead from parallelization and to
avoid repeated downloads of the same page.
To avoid downloading the same page more than once, the
crawling system requires a policy for assigning the new URLs
discovered during the crawling process, as the same URL can
be found by two different crawling processes. Cho and Garcia-
Molina studied two types of policy:
Dynamic assignment With this type of policy, a
central server assigns new URLs to different crawlers
dynamically. This allows the central server to, for instance,
dynamically balance the load of each crawler.
With dynamic assignment, typically the systems can
also add or remove downloader processes. The central server
may become the bottleneck, so most of the workload must be
transferred to the distributed crawling processes for large
crawls.
There are two configurations of crawling architectures with
dynamic assignment that have been described by Shkapenyuk
and Suel:
• A small crawler configuration, in which there is a central
DNS resolver and central queues per Web site, and distributed
downloaders.
• A large crawler configuration, in which the DNS resolver and
the queues are also distributed.
Static assignment With this type of policy, there is a fixed
rule stated from the beginning of the crawl that defines how to
assign new URLs to the crawlers.
For static assignment, a hashing function can be used
to transform URLs (or, even better, complete Web site names)
into a number that corresponds to the index of the
corresponding crawling process. As there are external links
that will go from a Web site assigned to one crawling process
to a Web site assigned to a different crawling process, some
exchange of URLs must occur.
To reduce the overhead due to the exchange of URLs
between crawling processes, the exchange should be done in
batch, several URLs at a time, and the most cited URLs in the
collection should be known by all crawling processes before
the crawl (e.g.: using data from a previous crawl) .
Architecture of a Web Crawler
Figure 3 depicts the typical architecture of a large-
scale Web crawler. By a large-scale crawler we mean a system
capable of gathering billions of documents from the current
World Wide Web. It is clear that with such a huge amount of
data more sophisticated techniques must be applied than
simply parsing HTML files and downloading documents from
the URLs extracted from there. As we may see at the picture,
much attention is paid to the problems of avoiding Web pages
(URLs) already visited before, parallelizing crawling (fetching
threads) and balancing the load of Web servers from which
documents are obtained (server queues), and speeding up the
access to Web servers (via DNS Caching).
Figure 3: Architecture of a typical Web crawler
National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
- 61 - | P a g e
The role of Web crawling
Since Web crawling is at the heart of each Web search
engine, rather general architectural descriptions of crawlers
without important details have appeared so far. Commercial
search engines treat their Web crawling techniques as business
secrets and prefer not to give their rivals a chance to take
advantage of their know-how. Another reason is to keep
essential information on crawling away from search engine
spammers who would abuse the information. Some of the
crawler architectures published are that of Alexa which is still
the Web robot of the Internet Archive, an early version of
Googlebot, being the crawler of Google, Mercator, which was
the spider of AltaVista, Ubicrawler, and Dominos.
In general, a Web crawler takes a URL from the
queue of pending URLs, it downloads a new page from the
URL, it stores the document to a repository and it parses its
text to find hyperlinks to URLs, which it then enqueues in the
queue of pending URLs in case they have not yet been
downloaded (“fetched”). Ideally, crawling is stopped when the
queue of pending URLs is empty. In practice, however, this
will never happen as the universe of a large-scale Web crawler
is almost infinite. The Web is steadily changing and will never
be crawled as a whole. So a reasonable terminating condition
must be set up for the crawler to stop. For example, a certain
number of documents have been fetched, a specific number of
terabytes of data has been downloaded, a particular time period
has elapsed, or the crawler simply runs out of resources (main
memory, storage capacities, etc.).
Internals
More specifically, a Web spider would like to do
many activities in parallel in order to speed up the process of
crawling. In fact, the DNS name resolving, i.e. getting IP
address of an Internet host by contacting specific servers with
name-to-IP mappings, and opening an HTTP connection to a
Web server may take up to a second which is often more than
receiving the response from a Web server (i.e. downloading a
small or middle-sized document with a sufficiently fast
connection). So the natural idea is to fetch many documents at
a time.
Current commercial large-scale Web robots fetch up
to several thousands of documents in parallel and crawl the
“whole” Web (billions of documents) within a couple of
weeks. Interestingly, parallelization objects offered by
operating systems such as processes and threads do not seem
advantageous for multiple fetching of thousands of documents
due to thread (process) synchronization overheads. Instead, a
non-blocking fetching via asynchronous sockets is preferred.
Indeed, present commercial search engines work with such
huge amounts of data that they have to use technologies that
are often beyond capabilities of traditional operating systems.
Google, for example, has a file system of its own.
Implementors of large-scale Web crawlers try to
reduce the host name resolution time by means of DNS
caching. The DNS server mapping host names to their IP
addresses is customized and extended with a DNS cache and a
prefetching client. The cache is preferably placed in the main
memory for a very fast lookup in the table of names and IPs. In
this way, server names that have already been put in the cache
before can be found almost immediately.
New names, though, have still to be searched for on
distant DNS servers. Therefore, the prefetching client sends
requests to the DNS server right after URL extraction from a
downloaded page and does not wait until the resolution
terminates (non-blocking UDP datagrams are sent). Thus, the
cache is filled up with corresponding IPs long before they are
actually needed. (DNS requests are kept completely away from
a common Web surfer. It is the Web browser that gets all the
work done.)
Avoiding redundancy
The biggest task of a crawler is to avoid redundancy by
eliminating duplicate pages and links from the crawl. A
crawler that does not respect this may easily end up in a spider
trap an infinite loop of links between the same pages. Such a
trapped spider can “crawl” the Web for ages and collect
petabytes of data, but it will be useless, because it gets stuck in
just one place of the Web. There must be a module
(isUrlVisited?) that checks whether or not a page has been
already fetched before putting its URL to the working pool of
pending documents (sometimes called frontier). The intuitive
solution is to have a list of URLs already visited and to
compare each newly extracted URL against this list.
Unfortunately, many problems arise here:
Different forms of URLs. URLs occur in various forms.
They may be absolute or relative, they may or may not
include port numbers, fragments, or queries that may
contain special or even non-latin characters, they may be
in lower case or upper case, etc. Before we can attempt to
compare URLs, we have to normalize them and produce
the so-called canonical form. In this form, every URL is
absolute, with the host name in lower case, without non-
latin characters and so on.
Too many URLs. To crawl a significant portion of the
Web, we would need to store somewhere a few billions
URLs for further comparisons. Imagine that an average
normalized URL is fifty characters long. Even for a one-
billion-pages crawl, a storage capacity of 50 billion bytes
(50 GB) would be required. Moreover, access to the list of
URLs visited must be very fast as the check will be very
frequent. How to resolve this difficulty? We can
somewhat reduce the size of URLs by encoding them into
MD5 fingerprints or CRC checksums. These fingerprints
may be four to eight bytes long according as how many
URLs we suppose to crawl. In addition, we can use each
fingerprint as a hash and store the URLs in a hash table on
disk. Disk seeks will still be slow, but we can improve this
with a two-level hashing host name hashing and path
hashing will be done separately for each URL
Duplicate pages with different URLs. Even if we are
careful enough and never crawl the same URL twice, we
can still download pages with the same content if they
have different URLs. In order to avoid adding links to the
frontier that appear as new, because they are relative to the
page with a different URL but with a duplicate content,
but in reality have been added before, it is necessary for
each newly fetched page to check whether it has been
downloaded yet (isPageKnown? module in Figure 3).
Again, we can use the MD5 hash function here. We will
maintain a list of fingerprints of fetched pages‟ contents and
compare each new page against it. Unfortunately, only a very
small difference between two pages that are otherwise
considered as duplicate, such as a different time stamp at the
National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
- 62 - | P a g e
bottom of the page, results in distinct fingerprints, and the
duplicates recognition fails. Thus, the process must be
enhanced by a technique called shingling, which detects near
duplicates.
Care must be taken not to overload Web servers with
requests. Not only does it prevent a denial of service by the
Web servers, but it is also a measure of politeness to other Web
users. Ideally, the load monitor & manager distribute requests
evenly among servers, for each of which there is a queue
pending URLs. It controls that the interval between two
requests sent to the same server be no less than, say, a minute.
Besides others, fetching pages uniformly from distinct servers
reduces the risk of getting stuck in a spider trap.
Dynamic pages
We have seen that the greatest danger for a Web
crawler consists in not recognizing that a Web page has
already been fetched before. If this happens, the spider may
easily crawl a very small part of the Web infinitely long. The
main sources of such difficulties are page duplication and site
mirroring (i.e. duplication of whole Web sites), dynamically
generated pages and Web host aliases. A computer with a
certain IP address may be represented by one or more host
names (virtual servers). On the other hand, a Web site may be
hosted by several machines with distinct IPs. This many-to-
many relation between host names and IPs along with aliases
(synonymous names of a Web site) makes the recognition of
known URLs even more difficult. Besides shingling for
duplicate pages, there exist techniques for the detection of
mirrored Web sites that may help resolve this problem as well.
But by far the biggest trouble is with dynamic pages such as
CGI, PHP, or Java scripts.
Dynamic pages are dangerous in that they can
generate whatever content (including what we are not at all
interested in), and that their number may be virtually infinitely
large. Dynamic pages often contain generated URLs that differ
only in one parameter of their query part. Also, they are often
results of a database query depending on what the Web user
types in a Web form, etc. It is feasible to store nor fingerprints
of their URLs neither of their contents because of their
immense number. How can we overcome this problem? The
most robust spider would just ignore dynamic pages. However,
it would probably miss a lot of important data. There have
even been attempts to crawl the hidden Web behind Web
forms . In practice, we must still observe crawling statistics
and set bounds for various parameters such as the number of
documents gathered on a site or the crawling depth (i.e. the
number of links followed leading to the current page).
Whenever a bound is exceeded, crawling as a whole or just on
that particular site is stopped. For example, for crawling
Baeza-Yates recommends five for static pages and fifteen for
dynamic pages.
Crawling strategies
Assume for simplicity that we are to crawl a small
part of the Web that is a tree. Because we are sure that this part
of the Web is finite and that we are going to visit all of its
pages, we can arbitrarily choose one of the two basic crawling
methods breadth-first or depth-first crawling. Let us recall
that with breadth-first crawling, we first visit nodes with the
same distance (number of links) from the root node. The data
structure used here to store links extracted from pages is a
queue. On the other hand, in depth-first crawling, we follow
links as
deep as we can. We put them on a stack. See Figure 4 for a
small example. Which of the two strategies is better? In this
simple case, they are the same provided we are not interested
in the order of visiting individual pages. At the end, we will
have a set of documents which we can, for example, add to a
corpus and build an index on it.
Figure 4: Breadth-first (left) and dept-first (right) crawling of a
simple Web tree.
In practice, however, neither is the Web graph a tree
nor can we collect all documents. Thus, if we know that we
will not be able to crawl all pages, we would like to crawl the
more important ones at least. Therefore, we expect a good
crawling strategy to visit more important pages sooner during
the crawl than a bad crawling strategy. Here, we only associate
a value of significance with each Web page and set the total
significance of all pages in the Web graph to be crawled to be
one. Then, at any time point of the crawl, we can plot the
importance value of all the pages crawled so far against the
fraction of the total number of pages to crawl.
Figure 5: Performance of a crawler sampling Web pages at
random.
In a crawl, where pages would be picked up randomly
(and uniformly) from the graph, the plot would be
approximately diagonal like in Figure 5. The diagonal line may
be considered as a baseline, and any crawler whose
performance curve plotted on the chart is above the diagonal
line is a more effective spider. Of course, normally we know
neither the total number of pages on the Web nor their
importance. Therefore, this measurement is possible for
synthetic (artificial) graphs, when the number of pages and
their importance are known before, or for pre-crawled Web
graphs with all the values required already computed. In both
cases, we call these “artificial” spiders crawling simulators.
Alternatively, we can measure crawling performance
retroactively and compute all the values when the crawl has
finished.
Baeza-Yates defines three groups of crawling strategies:
National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
- 63 - | P a g e
With no extra information. When deciding which page to
crawl next, the spider has no additional information
available except knowing the structure of the Web crawled
so far in the current crawl.
With historical information. The crawler additionally
knows the Web graph obtained in a recent “complete”
crawl. This is common for search engine spiders that
regularly crawl the Web in several-week intervals.
Typically, the spider knows what pages existed a couple of
weeks ago, what links they contained and what importance
the pages had which was computed after the crawl.
Although the Web changes very fast (about 25% new links
are created every week, the historical data were too costly
to acquire so that it could be entirely neglected. Thus, the
selection of a next page to crawl will be based on the
historical information.
With all information. This is a theoretical strategy not
usable in a real Web crawl. We will call it the omniscient
method, which perfectly knows the whole Web graph that
should be crawled including the values of importance of
individual pages. This method always chooses the page
with the highest importance from the frontier.
Crawling strategies with no extra information
Breadth-first. We mentioned this technique earlier. It is
reported to collect high quality (important) pages quite
soon. On the other hand, depth-first strategies are not much
used in real Web crawling, also because the maximum
crawling depth is worse controllable in them.
Back link-count. Pages in the frontier with a higher
number of in-links from pages already downloaded have a
higher priority of crawl.
Batch-PageRank. This technique calculates PageRank
values for the pages in the frontier after downloading every
k pages. Of course, these PageRanks are based on the graph
constituted of the pages downloaded so far, and they are
only estimates of the real PageRanks derived from the
whole Web graph. After each re-calculation, the frontier is
prioritized according to the estimated PageRank and the top
k pages will be downloaded next.
Partial-PageRank. It is like Batch-PageRank but with
temporary PageRanks assigned to new pages until a new re-
calculation is done. These temporary PageRanks are
computed non-iteratively unlike normal PageRanks as the
sum of PageRanks of inlinking pages divided by the
number of out-links of those pages (the so-called out-link
normalization).
OPIC. This technique may be considered as a weighted
back link-count strategy.
Larger-sites-first. This method tries to cope best with
the rule that Web sites must not be overloaded and choose
preferentially pages from Web sites having a large number
of pending pages. The goal is not to have at the end of the
crawl a small number of large sites, because that would
slower down crawling due to the delay required between
two accesses to the same site.
Crawling strategies with historical information
Again, we would like to order the pages in the frontier by
their PageRank and crawl the more important ones first. For
the pages encountered in the current crawl that existed when
the last crawl was run, we use their historical PageRank even
though we are aware that their current PageRank may have
changed. The pages that did not exist then have to be assigned
some estimates. There are several methods how to deal with
these new pages:
Historical-PageRank-Omniscient. Again, it is a
theoretical variant which knows the complete graph and
assigns “true” PageRanks to the new pages.
Historical-PageRank-Random. It assigns to the new
pages random PageRanks chosen from those computed for
the previous crawl.
Historical-PageRank-Zero. New pages are all assigned a
zero PageRank and are thus crawled after “old” pages.
Historical-PageRank-Parent. Each new page is assigned
an out-link-normalized PageRank of its parent page(s)
linking to it. If a parent page is new as well (there is no
historical PageRank associated with it) we obviously
proceed to the grandparent and so forth.
Enhancing Web Data
Server log files provide domain types, time of access,
keywords and search engine used by visitors and can provide
some insight into how a visitor arrived at a website and what
keywords they used to locate it. Cookies dispensed from the
server can track browser visits and pages viewed and can
provide some insight into how often this visitor has been to the
site and what sections they wander into.
Mining Web Data
So far most analyses of web data have involved log
traffic reports, most of which provide cumulative accounts of
server activity but do not provide any true business insight
about customer demographics and online behavior. Most of the
current traffic analysis software, including NetIntellect, Bazaar
Analyzer Pro, HitList, NetTracker, Surf Report, WebTrends,
and others offer predefined reports about server activity based
on the analysis of log files. This basically limits the scope of
these tools to statistics about domain names, IP addresses,
cookies, browsers and other TCP/IP specific machine-to-
machine activity.
On the other hand, the mining of web data for an e-
commerce site yields visitor behavior analyses and profiles,
rather than server statistics. An e-commerce site needs to know
about the preferences and lifestyles of its visitors. Data mining
in this context is about addressing business questions such as:
Who is buying what items and at what rates. You also would
like to know what is selling so you can adjust your inventory
and plan your orders and shipping. You need to know how to
sell and what incentives, offers and ads work, and how you
should design your site to optimize your profits.
Ten Steps to Mining Web Data
1. Plan Your Project: Identify Your Objective
The mining of website involves some advanced planning
about what type and level of information intend to capture
at the server and what additional data plan to match it
with. This by itself will ensure the data mining efforts will
yield measurable business results. For example, need to
plan with the web team what kind of log, cookie and form
information you intend to capture at what juncture from
the visitors.
2. Select Your Data
National Conference on Research Trends in Technology in Information Technology (Converg-2K10)
- 64 - | P a g e
Once your business objective has been defined, we must
then select the web server and company data for meeting
this goal.
3. Prepare the Data
Once the data has been assembled and visually inspected,
we must decide which attributes to exclude and which
attributes need to be converted into usable formats.
4. Evaluate the Data
We should evaluate the data's structure to determine what
type of data mining tools to use for the analysis
5. Format the Solution
There are a number of web mining formats or solutions.
Evaluate the web data and set the business objectives must
select the format of e- commerce solution.
6. Select the Tools
To choose the right mining tool, select not only the right
technology, but also consider the characteristics and
structure of the data.
Number of continuous value fields, Number of dependent
variables, Number of categorical fields, Length and type
of records, “Skewness" of the data set
7. Construct the Models
It is not until this stage that actually being mining the web
site files. Again, during the mining process, search for
patterns in data sets and generate classification rules,
decision trees, clustering, scores and weights, and evaluate
and compare error rates.
8. Validate the Findings
A data mining analysis of the web site will most likely
involve individuals from several departments, such as
information systems, marketing, sales, inventory, etc. It
most definitely will involve the administrators, designers,
analysts, managers.
9. Deliver the Findings
A report should be prepared documenting the entire web
mining process, including the steps took in selecting and
preparing the data, the tools you used and why, the tool
settings, findings, and an explanation of the code that was
generated is supposed
10. Integrate the Solutions
This process involves incorporating the findings into your
firm's business practices, marketing efforts, and strategic
planning. Web mining is a pattern-recognition process
involving hundreds, thousands or maybe millions of daily
transactions in your web site.
Conclusion
Web crawling is not only a trivial graph traversal
problem. It involves several issues that arise from the
distributed nature of the Web. First, Web crawlers must share
resources with other agents, mostly with humans, and cannot
monopolize Web sites‟ time –indeed, a Web crawler should try
to minimize its impact on Web sites. Second, Web crawlers
have to deal with an information repository which contains
many objects of varying quality, including objects with very
low quality created to lure the Web crawler and deceive
ranking schemes.
While the model implies that all the portions of the
search engine should know all the properties of the Web pages,
the architecture introduced in this survey is an attempt of
separating these properties into smaller units (text, link graph,
etc.) for better scalability.
Using Data mining solutions such as association,
segmentation, clustering, classification (prediction),
visualization and optimization and a data mining tool
incorporating a machine-learning algorithm, a website database
can be segmented into unique groups of visitors each with
individual behavior. Based on the research studies and
feedbacks provided by data processing group, gather a set of
data that should be stored and some possible queries that other
researchers may be interested in. Recommend a data schema to
be used for storing Internet data, as well as a possible
processing order for data loading.
References
1. Baeza-Yates R., Castillo C. Crawling the infinite Web: five
levels are enough. Proceedings
of the third Workshop on Web Graphs (WAW), Rome, Italy,
Lecture Notes in Computer Science, Springer, vol. 3243, pp.
156-167, 2004.
2. Baeza-Yates R., Castillo C., Marín M., Rodríguez A.
Crawling a country: better strategies than breadth-first for
web page ordering. Proceedings of the 14th international
conference on World Wide Web (WWW 2005), Chiba, Japan,
pp. 864-872, 2005.
3. Boldi P., Codenotti B., Santini M., Vigna S. UbiCrawler: a
scalable fully distributed Web crawler. Software Practice and
Experience, vol. 34, no. 8, pp.711-726, 2004.
4. Brin S., Page L. The Anatomy of a Large-Scale Hypertextual
Web Search Engine. Proceedings of the 7th World Wide Web
Conference, pp. 107 117, 1998.
5. Bröder A., Kumar R., Maghoul F., Raghavan P.,
Rajagopalan S., Stata R., Tomkins A., Wiener J. Graph
structure in the Web. Computer Networks vol. 33, no. 1-6, pp.
309320, 2000.
6. Chakrabarti S., Dom B. E., Gibson D., Kumar R., Raghavan
P., Rajagopalan S., Tomkins A. Spectral Filtering for Resource
Discovery. Proceedings of the ACM SIGIR Workshop on
Hypertext Information Retrieval on the Web, Melbourne,
Australia, pp. 13-21, 1998.
7. Chakrabarti S. Mining the Web: Analysis of Hypertext and
Semi Structured Data. Morgan Kaufmann Publishers, San
Francisco, California, USA, 2002.
8. Cho J, Shivakumar N., Garcia-Molina H. Finding Replicated
Web Collections. Proceedings of the 2000 ACM SIGMOD
International Conference on Management of Data, Dallas,
Texas, USA, pp. 355-366, 2000.
9. Cho J., Garcia-Molina H. Parallel crawlers. Proceedings of
the 11th international conference on the World Wide Web
(WWW‟02), Honolulu, Hawaii, USA, pp. 124-135, 2002.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.
Conference Paper
Full-text available
A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. This poses a problem to search engine managers: they need a rule to configure the search engine's crawler in such a way that it stops downloading pages from each Web site at some depth. But how deep must the crawler go? In this article, several probabilistic models for browsing "infinite" Web sites are proposed and studied. We use these models to estimate how deep a search engine must go to download a significant portion of the Web site content that is actually visited.
Article
Full-text available
We develop a technique we call spectral filtering, for discovering high-quality topical resources in hyperlinked corpora. Through relevance and quality judgements collected from 37 users, we show that, over 26 topics, spectral filtering usually finds web pages that are rated better than those returned by the hand-compiled Yahoo! resource list, and by the Altavista search engine. 1 Introduction As the amount of information available on-line grows exponentially, the user's capacity to digest that information cannot keep pace [24, 30]. In this paper we address the problem of distilling high-quality sources of information on broad topics in hyperlinked corpora (such as the WWW). As an example, we seek algorithms that can answer the question: "What are the twenty best pages on the web about History?" There are over five million pages on the web today that contain the word history; we seek only twenty, but they must be exemplary. The authors of [3] present a system called ARC which addres...
Conference Paper
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Conference Paper
The study of the Web as a graph is not only fascinating in its own right, but also yields valuable insight into Web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the Web graph using two AltaVista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the Web is considerably more intricate than suggested by earlier experiments on a smaller scale.
Article
We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.
Spectral Filtering for Resource Discovery
  • S Chakrabarti
  • B E Dom
  • D Gibson
  • R Kumar
  • P Raghavan
  • S Rajagopalan
  • A Tomkins
Chakrabarti S., Dom B. E., Gibson D., Kumar R., Raghavan P., Rajagopalan S., Tomkins A. Spectral Filtering for Resource Discovery. Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, Melbourne, Australia, pp. 13-21, 1998.