Conference PaperPDF Available

A Fast Community Based Algorithm for Generating Web Crawler Seeds Set.

Authors:

Abstract and Figures

In this paper, we present a new and fast algorithm for generating the seeds set for web crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then continues crawling from URLs which are found in these web pages. Crawlers are supposed to download more good pages in less iteration. Crawled pages are good if they have high PageRanks and are from different communities. In this paper, we present a new algorithm with running time O(n) for generating crawler's seeds set based on HITS algorithm. A crawler can download qualified web pages, from different communities, starting from generated seeds set using our algorithm in less iteration.
Content may be subject to copyright.
A Fast Community Based Algorithm for
Generating Web Crawler Seeds Set
Shervin Daneshpajouh, Motaba Mohammadi Nasiri
Computer Engineering Department, Sharif University of Technology, Azadi Avenue, Tehran, Iran
daneshpajouh@ce.sharif.edu, m_mohammadi@ce.sharif.edu
Mohammad Ghodsi
Computer Engineering Department, Sharif University of Technology, Azadi Avenue, Tehran, Iran
ghodsi@sharif.edu
Keywords: Crawling, Communities, Seed Quality Metric, Crawl Quality Metric, HITS, Web Graph,
Hyperlink Analysis.
Abstract: In this paper, we present a new and fast algorithm for generating the seeds set for web
crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then
continues crawling from URLs that are found in these web pages. Crawlers are supposed to
download more good pages in less iteration. Crawled pages are good if they have high
PageRanks and from different communities. In this paper, we present a new algorithm with
running time O(n) for generating crawler's seeds set based on HITS algorithm. A crawler
can download qualified web pages, from different communities, starting from seeds set
generated using our algorithm in less iteration.
1. Introduction
Nowadays web has an imperative impact on our daily life
providing required information. Based on [1], the size of
web was estimated 11.5 billion pages at 2005. This size is
now even larger and become larger as the time elapse.
Web search engines like Google, Yahoo, MSN,… have
important roles for facilitating information access. A web
search engine consists of three main parts: A crawler that
retrieves web pages, an indexer that builds indexes, and a
searcher. A major question a crawler has to face is that
which pages it has to retrieve so as to have the "most
suitable" pages in the collection [3]. Crawlers normally
retrieve a limited number of pages. In this regard, the
question is how fast a crawler collects the "most suitable"
pages. A unique solution to this question is not likely to
exist. In what follows, we try to answer this question.
Different algorithms with different metrics have been
suggested to lead a crawl towards high quality pages
[4,5]. In [4] Cho, Garcia-Molina, and Page suggested
using connectivity-based metrics to do so. To direct a
crawl, they have used different ordering metrics: breadth-
first, backlink count, PageRank, and random. They have
revealed that performing a crawl in breadth-first order
works nearly well if "most suitable" pages are defined to
be pages with high PageRanks.
Najork and Wiener extend the result of Cho et al. They
examined the average page quality over the time of pages
downloaded during a web crawl of 328 million unique
pages. They have showed that traversing the web graph in
breadth-first search order is a good crawling strategy.
Based on Henzinger's work [3] better understanding of
graph structure might lead to a more efficient way to
crawl the web. We use this idea in this paper to develop
our algorithm. First, we define the "most suitable" pages
and then we show how a crawler can retrieve most
suitable pages. We use three metrics to measure the
quality of a page. The first metric is the community of a
page. A collection of good crawls should contain pages
from different communities. The second metric is its
PageRank [2]. Pages with high PageRanks are the most
important pages in web. The third metric is number of
visited pages at iterations. A good crawler will visit more
pages in less iteration.
In this paper, we present a new fast algorithm for
extracting seeds set from a previously crawled pages.
Using offered metrics, we show that by starting from
extracted seeds suggested by our algorithm, a crawler will
quickly collect most suitable pages from different
communities.
We have studied different community extraction
algorithms: PageRank, Trawling, HITS, and Network
flow base community discovery. From our analysis, we
decided to use HITS ranking without keyword search in
our algorithm for community discovery and collecting
seeds set. We have found that bipartite cores are useful
for selecting seeds set. Bipartite cores contain Hub and
Authority pages. Since we are interested in having
Authority pages in our crawl, we would need to start
crawling from Hub pages. Hubs are durable pages, so we
can count on them for crawling.
The main idea in our method is to use HITS_Ranking on
the whole graph for extracting the most important
bipartite cores. We offer two bipartite core extraction
algorithms. Using these algorithms, we extract bipartite
cores and select some seeds from the hubs in the
extracted cores. Finally, we remove the extracted bipartite
core from the graph and repeat these steps till having
desired number of seeds.
We have compared the results of the crawls starting from
extracted seeds set produced by our algorithm, with
crawls starting random nodes. Our experiments show that
the crawl staring from seeds set identified by our
algorithm finds the most suitable pages of web very faster
than a random crawler did.
According to our knowledge, this is the first seeds
extraction algorithm that is able to identify and extract
seeds from different web communities. Low running time
is crucial in working with large size web data. The
running time of proposed algorithm is O(n). Low running
time with community base properties make this algorithm
unique in comparison with previous algorithms.
The remainder of this paper proceeds as follows. In
Section 2, we present our algorithm for discovering seed
sets in a large web graph and compute the complexity of
proposed algorithm. In section 3, we discuss the results of
running and evaluating this algorithm on 18M and 39M
node graphs. Section 4 contains conclusion and future
works.
2. ALGORITHM for DISCOVERING
SEED SETS in LARGE WEB GRAPH
In this section, we present our algorithm for discovering
seeds sets from web graph. Firstly, we start with a
discussion about the web structure. This will give some
intuition about our algorithm to the reader. The connected
web's macroscopic structure breaks into four pieces [6]:
SCC, IN, OUT, TENDRILS. SCC consist all of whose
pages can reach one another along directed link. IN
consists of pages that can reach the SCC, but cannot be
reached from it. OUT consists of pages that are accessible
from SCC, but do not link back to it. Finally, the
TRENDRILS contain pages that cannot reach the SCC
and cannot be reached from SCC.
From the structure and analysis presented in [6], the most
important web pages are expected to be in SCC+OUT.
From [2] we know that web pages with high PageRanks
are the most valuable pages in the web. In addition, from
[3, 7, 8] we understand that bipartite cores are one of the
most valuable sources in the web usually called hubs and
authorities. Beside, we know that the web contains
thousands of different communities. In [7] Kleinberg
used keywords and a ranking method for finding hubs
and authorities pages which result contains pages in one
or more communities. In [9] Flake et al suggested a
method based on network flow algorithm for finding web
communities. The relation between communities
extracted from their algorithm and Kleinberg's
community is that the result of the former is expected to
be a subset of Flake et al algorithm [10]. See Figure 1 for
a sample relationship between results of these two
algorithms.
A crawler normally do not crawl the entire web. Instead,
it continues to retrieve a limited number of pages.
Crawlers are expected to collect "most suitable" pages of
web rapidly. We defined "most suitable" pages of web as
those pages with high PageRank. These pages are pages
that HITS algorithm calls them Authority pages. The
difference is that HITS algorithm finds the authorities
pages relating to keywords but PageRank shows the
importance of a page in the whole web. As well, we know
that good hubs link to good authorities. If we are able to
extract good hubs from a web graph and different
communities, we will be able to download good
authorities that have high PageRank of different
communities.
We use HITS ranking without keywords search on
previously crawled web pages and then prune the
resulting sub graph. The procedure of using HITS ranking
and the elimination will be repeated until enough seed
points are reached. We show that selected hub nodes
from the resulted sub graph of iterations can solve the
crawl problem mentioned in this paper.
Figure 1. Relation Between Community Extracted by
NetworkFlow and Its relation to HITS Bipartite Core
2.1. Iterative HITS Ranking & Pruning
We assume that we have a web graph of crawled web
pages. The goal is to extract seeds set from this graph so
that a crawler can collect the most important pages of the
web in less iteration. To do this we run HITS ranking
algorithm on this graph. This is the second step of HITS
algorithm. In the first step, it searches the keywords in an
index-based search engine. For our purpose, we ignore
this step and only run the ranking step on the whole
graph. In this way, bipartite cores with high Hub and
Authority rank will become visible in the graph. Then we
select the most highly ranked bipartite core using two
algorithms we suggest, namely: extracting seeds with
fixed size, and extracting seeds with fixed density. Then,
we remove this sub-graph from the graph, repeat ranking,
seed extraction, and sub-graph removal steps up to a
point that we have enough seeds set.
A question that may arise is that when repeating these
steps, why we need to run HITS ranking again. Isn't one
time ranking enough for whole steps? The answer is that,
removing bipartite core in each step modifies the web
graph structure we are working on. In fact, re-ranking
changes the hub and authority ranks of bipartite cores.
Removing high ranked bipartite core and re-ranking web
graph drive appeared bipartite cores to be from different
communities. Thus, a crawler will be able to download
pages from different communities starting from these
seeds. We have experimented our algorithm using web
graph of UK 2002 containing 18M nodes and 298M
edges, and UK 2005 containing 39M nodes and 936M
edges [6, 7]. Our experiments prove that extracted
bipartite cores have a reasonable distance from each
other.
The other question that may arise is that if a crawler starts
from seeds resulted from our algorithm, why would the
results of crawl lead to the most suitable pages. The
answer is that, in iterations of algorithm, we select and
extract high ranked bipartite cores from the web graph.
Extracted bipartite cores have high hub or authority
ranks. It is expected that pages with high hub rank link to
pages with high PageRank. Our experiments prove the
correctness of this hypothesis.
2.2. HITS Ranking
Ranking is the second step of HITS algorithm [7]. In our
algorithm, the HITS-Ranking procedure is given a
directed graph G as an input and HitsIterationCount. The
procedure defines two h and a vectors for Hub and
Authority rank of nodes. Their initial value is set to 1.
Then algorithm updates h and a vectors of all nodes in the
input graph. Afterwards, the updated vectors are
normalized to 1. The Algorithm repeats vector updates
and normalization steps until it reaches the required
number of iterations. Two vectors h and a are returned as
the results of HITS-Ranking procedure. Figure 2 shows
HITS-Ranking algorithm.
2.3. Extracting Seeds with Fixed Size
The Extract-Bipartite-Cores-with-Fixed-Size procedure,
as its name indicates, extracts one bipartite sub-graph
with highest hub and authority ranks with predetermined
size given as an input. The Algorithm is given a directed
graph G, BipartiteCoreSize, NewMemberCount and h,
and a vectors. BipartiteCoreSize specifies the desired size
of bipartite core we like to be extracted.
NewMemberCount indicates that in each iteration of the
algorithm how many hub or authority nodes should be
added to the hub or authority sets. h and a vectors are hub
and authority ranks of nodes in the input graph G.
In the initial steps, the Algorithm sets HubSet to empty
and adds the node with highest authority rank to
AuthoritySet. While sum of AuthoritySet size and HubSet
size is less than BipartiteCoreSize it continues to find
Procedure HITS-Ranking
Input: graph: G=(V,E) , integer:HITSIterationCount
1) For all v in V do
2) Set a(v) = 1;
3) Set h(v) = 1;
4) End For
5) For i=1 to HITSIterationCount do
6) For all v in V do
7)
=
Evw
whva
),(
)()( ;
8) End For
9) For all v in V do
10)
=
Ewv
wavh
),(
)()( ;
11) End For
12) Normalize h vector to 1;
13) Normalize a vector to 1;
14) End For
output: h, a
End Procedure
Figure 2. HITS Ranking Algorithm
Figure 3. Extracting Bipartite Cores with Fixed Size
Procedure Extract-Bipartite-Cores-with-Fixed-Size
Input: graph: G=(V,E) , integer: BipartiteCoreSize,
NewMemberCount;
vector: h,a.
1) HubSet =
;
2) AuthoritySet= Add v with highest a(v) to AuthoritySet;
3) While |AuthoritySet| + |HubSet| < BipartiteCoreSize do
4) AuthoritySet= AuthoritySet (Find Top
NewMemberCount h(v) where v,w E
and w in AuthoritySet and v not in HubSet);
5) HubSet = HubSet (Find Top
NewMemberCount a(v) where v,w E and v in
AuthoritySet and w not in HubSet );
6) End While
output: HubSet, AuthoritySet
End Procedure
new hubs and authorities regarding the
NewMemberCount and adds them to the related set. We
use this procedure when we like to extract bipartite sub-
graph with fixed size. Figure 3 shows the details of
Extract-Bipartite-Cores-with-Fixed-Size procedure. In
Figure 4 we show the steps of bipartite sub-graph creation
with NewMemberCount equal to 1. An interesting result
we have found in our experiments is that at very first
steps, all the hubs have links to all authorities. This
leaded us to suggest an extraction algorithm with density
factor that is described in the following subsection.
2.4. Extracting Seeds with Fixed Cover
Density
The Extract-Bipartite-Cores-with-Fixed-CoverDensity
procedure, as its name indicates, extracts one bipartite
sub-graph with highest hub and authority ranks in a way
that the sub-graph has the desired cover-density function.
A directed graph G, CoverDenstity, and h,and a vectors
are given to the algorithm.
We define Cover-Density as follows:
(1)
|||| |),(|
*100 etAuthoritySHubSet etAuthoritySHubSetE
This measure shows how many nodes in the authority set
are covered by nodes in hub set. If the bipartite sub-graph
is a complete bipartite sub-graph, this measure will be
equal to 100. Therefore, if we intend to extract complete
bipartite sub-graph we set CoverDenstity to 100. h and a
vectors are hub and authority ranks of nodes in the input
graph G.
In initial steps, the Algorithm sets HubSet to empty set
and adds the node with highest authority rank to
AuthoritySet. In addition, it sets CoverDensityCur to 100.
While CoverDensityCur is bigger than or equal to input
CoverDensity, procedure continues to find new hubs and
authorities. This algorithm adds only one new node to the
sets at each iteration of the algorithm. Remember that in
Extract-Bipartite-Cores-with-Fixed-Size we could adjust
the count of new members. Here, we do not have such a
variable. This is because of the fact that we like to have a
precise cover density here. In other words, if we increase
the number of new nodes to more than 1, this might cause
the reduction of the accuracy of desired cover density.
We use this procedure when we like to extract bipartite
sub-graph with desired density between hubs and
authorities. Figure 5 shows the details of the Extract-
Bipartite-Cores-with-Fixed-CoverDensity procedure.
2.5. Putting It All Together
Up to now, we have presented algorithms for HITS-
Ranking and bipartite core extraction based on hub and
authority ranks. Our goal is to extract a set of desired
number of seeds to crawl and download pages from
different web communities with high PageRank in less
iteration. We use the proposed algorithms to achieve this
goal. We assume that we have a web graph of crawled
web pages. Then we run HITS-Ranking algorithm on the
whole graph and use one of the bipartite core extraction
algorithms we have presented. Then we select arbitrarily
one of the nodes in the extracted hub set and add it to our
seeds set. Finally, we remove the extracted core from the
input graph and repeat these steps until we find ideal
number of seeds.
We can use one of these two bipartite-core extraction
algorithms that we have proposed: Extract-Bipartite-
Cores-with-Fixed-Size, Extract-Bipartite-Cores-with-
Fixed-CoverDensity. If we wish bipartite cores to have a
fixed size we use the first algorithm and if we are looking
for bipartite cores having desired cover density, then we
use the second algorithm. For example, if we like
bipartite cores to be complete we must use the second
algorithm. We have experimented both of these
algorithms. As we cannot guess suitable size of a web
community, we use the second method. The second
method can calculate density of links between hubs and
authorities. If we have a complete bipartite core then we
are sure that the all the authority pages are from the same
community. By decreasing the Cover-Density measure,
we decrease the degree of relationship between authority
pages. Because the second method is more reliable than
the first one, in this paper we only present experimental
results achieved from using Extract-Bipartite-Cores-with-
Figure 4. Steps of bipartite sub-graph creation with
NewMemberCount equal to 1. (a) Shows the sub-graph
after adding the highest authority rank node and
adding authority with highest rank that refer to this
authority node. In (b), the next authority with highest
rank which is not in authority set and linked by the only
node in the hub set. In (c), the second hub node with
highest hub rank which was not already in the hub set
and linked to one of the nodes in authority set. In (d)
resulted sub-
g
ra
p
h after 4 ste
p
s is shown.
(a)
(b)
(c)
(d)
Fixed-CoverDensity. Figure 6 shows the seed extraction
algorithm we have used in our experiments in this paper.
The Extract-Seed algorithm receives a directed graph G
and SeedCount as input. At the initial step, the algorithm
set SeedSet to empty. While the size of SeedSet is less
than SeedCount, the algorithm keeps running. In the first
line of while section, algorithm calls HITS-Ranking
procedure with G as the input graph and 60 as
HITSIterationCount. Kleinberg's work shows that
HITSIterationCount equal to 20 is enough for
convergence of hub and authority ranks in a small sub-
graph [7]. We have found experimentally that a number
more than 50 is enough for convergence of hub and
authority ranks with the dataset we use. HITS-Ranking
algorithm returns two vectors, h and a, containing result
of hub and authority ranks of all nodes in graph G. In the
next line algorithm calls Extracting-Bipartite-Cores-with-
Fixed-CoverDensity with G as input graph, 100 as cover
density value, and h and a as hub and authority vectors.
This function finds complete bipartite cores in the input
graph and returns complete bipartite nodes in HubSet and
AuthoritySet. In the next line, a node randomly is selected
from hub set and it is added to the SeedSet. Now
algorithm removes the hub and authority nodes and their
edges from the graph G. The removal step helps us to
find seeds from different communities.
2.6. Complexity of Proposed Seeds
Extraction Algorithm
The running time of Seeds-ExtractionAlgorithm, Figure
6, is O(n), where n is number of nodes in the input graph.
The while loop of lines 2-12 is executed at most
|SeedCount| times. The work of line 4 is done in O(n).
Because, the complexity of HITS-Ranking, Figure 2, is
equal to
Θ
.(K*2*L*n) where K is |HitsIterationCount|, L
is the average number of neighborhoods of a node and n
is number of nodes in the input graph. This complexity is
multiplied by 2 because there are two steps for this kind
of computation, one for hub vector and the other for
authority vector. In addition, the normalization steps can
be done in
Θ
(3n). So, the complexity of HITS-Ranking
is O(n).
The running time of Extracting-Bipartite-Cores-with-
Fixed-CoverDensity in line 4 is O(n). The while loop of
lines 4-8, in figure 5, is executed at most |HubSet +
AuthoritySet| times which can be viewed as a constant
number k. Finding and adding a distinct hub node with
highest rank to hub set, in line 5, takes
Θ
(k*n). Finding
and adding a distinct authority node with highest rank to
authority set, in line 6, takes
Θ
(k*n). So, the running time
of Extracting-Bipartite-Cores-with-Fixed-CoverDensity
is at most O(n).
The removal steps of lines 6-11, in figure 6, takes O(n)
for removing identified hubs and authorities.
Therefore, the total running time of Seed-Extraction
Algorithm is O(|SeedCount|*n) which is equal to O(n).
3. EXPERIMENTAL RESULTS
In this section, we apply our proposed algorithm, to find
seeds set from previously crawled pages. Then, we start a
crawl using extracted seed on the same graph to evaluate
the result. To show that how applying the algorithm on
old data can provide good seeds for a new crawl, we start
the crawl on a newer graph using seeds set extracted from
a previous crawl.
3.1. Data Sets
The laboratory for Web Algorithmics at university of
Milan provides different web graph data sets [11]. In our
experiments, we have used UK-2002 and UK-2005 web
graph data sets provided by this laboratory. These data
sets are compressed using WebGraph library. WebGraph
is a framework for studying the web graph [12]. It
provides simple ways to manage very large graphs,
exploiting modern compression techniques. With
WebGraph, we can access and analyze a very large web
graph on a PC.
Figure 6. Seeds Extraction Algorithm
Procedure Extract-Seeds
Input: graph: G=(V,E) , integer: SeedCount;
1) SeedSet =
2) While |SeedSet| < SeedCount do
3) h, a = HITS-Ranking( G , 60);
4) HubSet, AuthoritySet = Extracting-Bipartite-Cores-
with-Fixed-CoverDensity(G, 100, h, a);
5) SeedsSet = SeedsSet Select a node arbitrarily
from HubSet;
6) For all v in HubSet do
7) Remove v and all E(v) from G;
8) End For
9) For all v in AuthoritySet do
10) Remove v and all E(v) from G;
11) End For
12) End While
output: SeedsSet
End Procedure
Figure 5. Extracting Bipartite Cores with Fixed Density
Procedure Extract-Bipartite-Cores-with-Fixed-CoverDensity
Input: graph: G=(V,E) , integer: CoverDensity;
vector: h,a.
1) HubSet =
;
2) AuthoritySet = Add v with highest a(v) to
AuthoritySet;
3) CoverDensityCur = 100;
4) While CoverDensityCur CoverDensity do
5) AuthoritySet= AuthoritySet (Find Top
NewMemberCount h(v) where v,w E
and w in AuthoritySet and v not in HubSet);
6) HubSet = HubSet (Find Top
NewMemberCount a(v) where v,w E and v in
AuthoritySet and w not in HubSet );
7) CoverDensityCur =
|||| |),(|
*100 etAuthoritySHubSet etAuthoritySHubSetE ;
8) End While
output: HubSet, AuthoritySet
End Procedure
0
2
4
6
8
10
12
14
16
18
0246810
log(In-Degree)
0
2
4
6
8
10
12
14
16
02468
log(Out-Degree)
0
2
4
6
8
10
12
14
16
18
0246810
log(In-Degree)
0
2
4
6
8
10
12
14
16
02468
log(Out-Degree)
Table 1. UK 2002 and 2005 Data Sets information
before pruning
Data Set Nodes Edges Diameter
Estimate
UK-2002 18,520,486 298,113,762 14.9
UK-2005 39.459.935 936,364,282 15.7
Table 2. UK 2002 and 2005 Data Sets information
after prunning
Data Set Nodes Edges
UK-2002 18,520,486 22,720,534
UK-2005 39.459.935 183,874,700
0
0.5
1
1.5
2
2.5
3
3.5
1234567891011
HUB
Authority
0
0.5
1
1.5
2
2.5
3
3.5
4
12345678910
HUB
Authority
Figure 10. Log-Log Out-Degree UK 2005
Fi
g
ure 9. Lo
g
-Lo
g
In-De
ree UK 2005.
Figure 8. Log
-
Log OutDegree UK 2002
Figure 7. Log-Log In-Degree UK 2002.
Figure 12. Log-Log diagram of Hub and Authority sizes
Extracted from UK 2005 in different iteration
Figure 11. Log-Log diagram of Hub and Authority sizes
Extracted from UK 2002 in different iteration
3.1.1. UK-2002
This data set has been obtained from a 2002 crawl of the
.uk domain performed by UbiCrawler in 2002 [13]. The
graph contains 18,520,486 nodes and 298,113,762 links.
3.1.2. UK-2005
This data set has been obtained from a 2005 crawl of the
.uk domain performed by UbiCrawler in 2005. The crawl
was very shallow, and aimed at gathering a large number
of hosts, but a small number of pages from each host.
This graph contains 39.459.935 nodes and 936,364,282
links.
3.2. Data Set Characteristics
3.2.1. Degree Distribution
We had investigated the degree distribution of UK-2002
and UK-2005. Figure 7 and 8 show In-degree and Out-
degree distribution for UK-2002 in log-log form. Figure 9
and 10 show In-degree and Out-degree distribution for
UK-2005 in log-log form. The results show that the In-
degree and Out-degree distribution are power laws in
these two datasets.
3.2.2. Diameter
Diameter of a web graph is defined as the length of
shortest path from u to v, averaged over all ordered pairs
(u,v) [14]. Off course, we omit the infinite distance
between pairs that there is not a path between them. This
is called average connected distance in [6]. We had
estimated this measure on UK-2002 and UK-2005 data
sets through experiments. Table 1 shows the estimated
diameter of these data sets together with the number of
nodes and edges.
We use the resulted diameter to evaluate the distances
between extracted bipartite cores resulting from our
method.
3.3. Date Preparation
3.3.1. Prunning
Most of the links between pages in a site are for
navigational purposes. These links may distort the result
of presented algorithm. The result of the HITS-Ranking
algorithm on this un-pruned graph will result in hub and
authority pages to be found in a site. To eliminate this
effect we remove all links between pages in the same site.
We assume pages with the same host-name are in the
same site. Table 2 shows the number of nodes and edges
after pruning in the UK data sets.
0.00E+00
5.00E-05
1.00E-04
1.50E-04
2.00E-04
2.50E-04
3.00E-04
3.50E-04
1
3
5
7
9
11
13
15
17
19
Iteration
PageRank
Our Algorithm Mean PageRank Random Algorithm Mean PageRank
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 101112131415161718
Iteration
Log-Count Traversed pages
Our Method Random
Figure 15. Comparison between Log Count diagram of
pages visited at each Iteration starting from 10 seeds
extracted by our method and 10 seeds selected
randomly from uk-2002
Figure 14. Comparison between PageRank of Crawled
pages starting from 10 seeds extracted by our method
on UK-2002 and Random Seeds
Figure 13. Graphical representation of Distances
between 56 extracted seeds from UK-2002 by our
algorithm. Numbers besides of each node indicate the
iteration number in which related node has been
extracted.
3.4. Results of Extracted Seeds using
Proposed Algorithm
We run our algorithm for seeds extraction, Extract-Seeds
, on UK-2002 and UK-2005. This algorithm, as Figure 6
shows, sets CoverDensity to 100 for seeds extraction. It
searches and extracts complete bipartite cores and then, at
each step, selects the seeds from hub nodes in the
bipartite sub-graph (see figure 4). Figure 11 shows the
size of extracted hubs and authorities in different iteration
from UK-2002. It is clear that these cores are complete-
bipartite. To reduce the impact of outlier hub sizes in the
graphical presentation, we have used a log-log diagram.
Figure 12 depicts the size of the extracted hubs and
authorities in different iterations for UK-2005.
Normally the hub sizes are bigger than the authority
sizes. We obtained bipartite cores with very large hub
size in UK-2002. So, we have limited the number of hubs
to 999 in UK-2002 data set.
3.5. Quality Analysis
3.5.1. Metrics for Analysis
We used some different metrics to evaluate the quality of
extracted seeds. The first metric is the distance between
extracted seeds. As we have mentioned earlier, a crawler
tends to extract web pages from different communities.
Using HITS-Ranking and Iterative pruning we can
conclude that extracted seeds are from different
communities. To prove the intuition, we measure the
distances between extracted cores. We have defined core-
distances as the nearest directed path between one of the
nodes from source core to one of the nodes in the
destination core.
The second metric, is the PageRank of pages that will be
crawled starting from these seeds. Previously, we have
defined the most suitable pages in the web to be the pages
with high PageRanks. Therefore, if the average PageRank
of crawled pages, at each step of crawl, is bigger than a
random crawl, especially at the beginning, then we can
conclude that a crawl that starts from those seeds
identified by our algorithm will result in better pages.
The third metric is the number of crawled pages at each
step of crawling. We focus on a crawler whose goal is to
download good pages in small iteration. Thus, if the
number of crawled pages staring from extracted seeds by
our method at each step is bigger than the number of
crawled pages starting a random set, then we can
conclude that, our method leads a crawl toward visiting
more pages in less iteration too.
For the first metric, we measure the distance between the
cores. For the other two metrics, we need to crawl the
graph starting from seeds extracted with our method and
compare it with a crawl staring from randomly selected
seeds.
3.5.2. Result of Bipartite Core Distances
We have measured the distance between all bipartite
cores that were extracted from UK datasets and they had
a reasonable distance in comparison with diameter of the
related graph. Figure 13, shows the graphical
representation of distances between 54 cores extracted
from UK-2002. The number on the top of each node
indicates the iteration number in which the core has been
extracted. Because the distance graph between nodes may
have not a Euclidian representation, distances in this
figure is not exactly match with real distances. The other
important information is that bipartite cores in close
iterations have a distance equal or bigger than average
distance. In addition, the cores that are close to each other
(there is a short directed path between them) are
identified in far iteration. As an example, the core
extracted from iteration 32 has a distance one to the core
extracted from iteration 47. In the sample, the minimum
distance between nodes is 1 and maximum distance is 13.
The average distance is 7.15. As the diameter of UK-
2002 data set is 14.9, core distances are fine.
3.5.3. Result of Average PageRank and Visit
Count
In this section, we evaluate the second and third metrics
defined for evaluation. For UK-2002 we have executed
the Extract-Seed algorithm with SeedCount=10.
Therefore, the algorithm extracts one seed from each core
Figure 17. Comparison between Log Count Diagram of
pages visited at each Iteration starting from 10 seeds
extracted by our method from UK-2005 and 10 seeds
selected randomly on UK-2005
Figure 16. Comparison between PageRank of Crawled
pages starting 10 seeds extracted from UK-2005 by our
method and 10 Random seeds selected from UK-2005
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Our Method-Seed 2005 Random Seeds
0.00E+00
2.00E-04
4.00E-04
6.00E-04
8.00E-04
1.00E-03
1.20E-03
1.40E-03
1.60E-03
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Iteration
PageRank
Our Method-Seed 2005 Random Seeds
in iteration. Then, we have started a crawling on UK-
2002 data set implementing BFS strategy and measured
the average PageRank of visited pages in each crawl
depth, and the number of pages visited in each crawl
depth. Then, we have compared the results with results
gained from a crawl starting from random seeds for the
same graph.
Figure 14 shows the comparison of average PageRank of
crawl starting from seeds extracted with our method and a
crawl starting from random seeds. Except the first depth
(iteration) of crawl, in the other steps, up to step 4, the
average PageRank of pages crawled with our method
appear to be better. Specially, in the second and third
iterations, the difference is superior. In the later iterations
average PagRank of visited pages are close.
Figure 15 shows the comparison of log-number of pages
visited in each depth of crawl on UK-2002. For better
graphical representation, we have computed log-count of
visited pages. Apparently, the results of our method is
always better than crawl starting random seeds and a
crawl with seeds extracted with our method downloads
more pages in less iteration. Figure 16 and 17 show the
experiments on UK-2005. The same results appear here
too.
3.5.4. Good Seeds for a New Crawl
Using proposed algorithm we have discovered seeds from
UK 2002 and UK 2005. Then we have evaluated the
goodness of these seeds using three evaluation criteria.
These evaluations are goods, but a real crawler has not
access to seeds of web graph which it is going to crawl..
We should show that the result is always good if we start
the crawl using seeds extracted from an old crawled
graph.
In this section, we show the result of crawling on UK
2005 using seeds extracted by proposed algorithm and we
compare it by randomly crawled seeds to simulate the
real environment. Before algorithm's execution, we have
checked the validity of seeds found from UK-2002 in
UK-2005 data set. If a seed does not exist in newer graph,
then we remove that seed from our seeds set. Our
experiments show that only 11 present of seeds exist in
the new data set. In fact, we have extracted 100 seeds
from UK 2002 to be sure that we have 11 valid seeds in
UK 2005.
Figure 18 shows the comparison of average PageRank of
crawl starting from seeds extracted with our method and a
crawl starting from random seeds. The result of our
method is better until iteration 3. Figure 19, shows the
comparison of log-number of pages visited in each depth
of the crawl. In this case, the result of our method is
better than the random case between 4 and 15 steps. In
fact our method download pages with high page rank till
iteration 3 and after that it crawls more pages than the
random case till iteration 15. After that the result is nearly
the same. Therefore, we can conclude that, a crawler can
download qualified web pages in less iteration; starting
seed sets generated using our algorithm in less iteration.
0.00
5.00
10.00
15.00
20.00
25.00
30.00
1 3 5 7 9 111315171921232527293133353739
Our Method Seeds from UK 2002 Random Seeds from UK 2005
0
1
2
3
4
5
6
7
8
13579111315171921232527293133353739
Our Method Seeds from UK2002 Random Seeds from UK2005
4. CONCLUSION and FUTURE
WORKS
Crawlers like to download more good pages in little
iteration. In this paper, we have presented a new fast
algorithm with running time O(n) for extracting seeds set
from previously crawled web pages. In our experiments
we have showed that if a crawler starts crawling from
seeds set identified by our method, then it will crawl
more pages with higher PageRank in less iteration and
from different communities than starting a random seeds
set. In addition, we have measured the distance between
selected seeds to be sure that our seeds set contains nodes
from different communities. According to our
knowledge, this is the first seeds extraction algorithm that
is able to identify and extracts seeds from different
communities.
Our experiments were on graphs containing at most 39M
nodes and 183M edges. This method can be experienced
on larger graph in order to investigate the quality of result
on them too.
Figure 19. Comparison between Log Count Diagram of
pages visited at each Iteration starting from 11 seeds
extracted by our method from UK-2002 and 11 seeds
selected randomly on UK-2005
Figure 18. Comparison between PageRank of Crawled
pages starting 11 seeds extracted from UK-2002 by our
method and 11 Random seeds selected from UK-2005
Another aspect where improvement may be possible is
the implementation of the seeds that are not found in a
new crawl. In our experiments, we have simply ignored
nodes were present in an older graph but not in the newer
graph. This aspect may be possible to be improved by
finding similar nodes in the newer graph.
5. REFERENCES
[1] Gulli, A., and Signorini, A. The Indexable Web is More
than 11.5 billion pages. WWW (Special interest and tracks
and posters), (May. 2005), 902-903.
[2] Brin , S. and Page, L. The anatomy of a large-scale
hypertextual Web search engine. Proceedings of the
seventh international conference on World Wide Web 7,
Brisbane, Australia, 1998, 107 – 117.
[3] Henzinger, M. R. Algorithmic challenges in Web Search
Engines. Internet Mathematics,Volume 1, Number 1,
2003, 115-123.
[4] Cho,J. Garcia-Molina, H. and Page, L. Efficient Crawling
through URL ordering. In Proceedings of the 7th
International World Wide Web Conference, pages 161-
172, Brisbane, Australia, April 1998. Elsevier Science
[5] Najork, Wiener, J. L. Breadth-First Search Crawling
Yields High-Quality Pages, Proceedings of the 10th
international conference on World Wide Web WWW '01,
2001.
[6] Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar
Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew
Tomkins, Janet L. Wiener: Graph structure in the Web.
Computer Networks 33(1-6): 309-320 (2000)
[7] Jon M. Kleinberg: Authoritative Sources in a Hyperlinked
Environment. J. ACM 46(5): 604-632 (1999)
[8] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan,
Andrew Tomkins: Trawling the Web for Emerging Cyber-
Communities. Computer Networks 31(11-16): 1481-1493
(1999)
[9] Gary William Flake, Steve Lawrence, C. Lee Giles, Frans
Coetzee: Self-Organization and Identification of Web
Communities. IEEE Computer 35(3): 66-71 (2002)
[10] J. Kleinberg, S. Lawrence. The Structure of the Web.
Science 294(2001), 1849.
[11] Laboratory for Web Algorithmics, http://law.dsi.unimi.it/
[12] Paolo Boldi and Sebastiano Vigna, The WebGraph
framework I: Compression techniques. In Proc. of the
Thirteenth International World Wide Web Conference
(WWW 2004), pages 595-601, Manhattan, USA, 2004.
ACM Press.
[13] Boldi, P., Codenotti, B., Santini, M., Vigna, S.,
UbiCrawler: A Scalable Fully Distributed Web Crawler,
Journal of Software: Practice & Experience, 2004, volume
34, number 8, pages 711—726.
[14] Albert, R. Jeong, H. Barabasi, A.L. A random Graph
Model for massive graphs, ACM symposium on the
Theory and computing 2000.
... Çalışmada, PageRank, Trawling, HITS ve ağ akış tabanlı keşif algoritmaları üzerinde çalışılmıştır. Yaptıkları deneylerde, web tarayıcısı önerilen yöntem ile tanımlanan tohum URL'leri taramaya başlarsa, rastgele tohum kümesi başlatmaktan daha az yinelemede ve farklı topluluklardan daha yüksek PageRank değerine sahip daha fazla sayfa tarandığını göstermişlerdir [5]. ...
Article
Full-text available
Web, İnternet üzerinde yayınlanan çeşitli türden bilgilerin bulunduğu bir veri deposudur. Bu bilgileri üzerinde bulunduran ve birbirlerine köprülerle bağlı olan yapılara web sayfaları denir. Web tarayıcıları, web sayfaları üzerindeki köprüleri kullanarak Web’i tarayan ve sayfaları indiren programlardır. Bir arama motorunun performansı da web tarayıcısının performansına bağlıdır. Web tarayıcılarının performans metrikleri, kapsamı ve tohum URL seçim yöntemleri performansı etkileyen en önemli faktörlerdir. Bu çalışmada, genel, odaklanmış, artırılmış, gizli, mobil ve dağıtılmış olmak üzere altı kategoride sınıflandırdığımız web tarayıcılarının performansları, kapsamları ve tohum URL kullanım yöntemleri hakkında kapsamlı bir inceleme ve analiz yapılmıştır. Ayrıca her bir tarayıcının çeşitli çalışmalarda yapılmış performans ölçütleri karşılaştırılmıştır.
... Daneshpajouh ve arkadaşları [9], farklı topluluklardan tohum URL' leri tanımlayan ve çıkaran ilk tohum çıkarma algoritmasını önermişlerdir. Algoritma, tohum URL'lerin farklı topluluklardan düğümler içermesini garanti etmek için seçilen tohum URL' ler arasındaki mesafeyi ölçmektedir. ...
Article
Full-text available
Web, hızla büyüyen ve her türden verilerin bulunduğu devasa bir veri kaynağıdır. Kullanıcılar bu veri kaynağından istedikleri verileri almak için arama motorlarını kullanırlar. Arama motorları bu verileri web tarayıcıları ile elde ederler. Web tarayıcıları web sayfalarındaki tek düzen kaynak bulucuları (URL-Uniform Resource Locator) izleyerek ulaştıkları tüm sayfalardaki verileri alır, ayrıştırır ve indekslerler. Web tarama sürecindeki en önemli konular hangi URL’lerden başlanacağı ve taramanın kapsamıdır. Bu yazıda kapsamı tüm web olan genel bir tarayıcının tohum URL seçim ve kapsam genişletme yöntemleri sunulmuştur. Tohum URL seçiminde 102 farklı ülkede ziyaretçinin günlük harcadığı saat, ziyaretçi başına günlük sayfa görüntüleme sayısı, aramadan gelen trafiğin yüzdesi ve toplam bağlı site sayısı temel alınarak oluşturulmuş üç farklı tohum URL seti oluşturulup detaylı bir şekilde performansları analiz edilmiştir. Ayrıca kapsamı hızlı bir şekilde genişletmek için link skoruna dayalı yeni bir tarama algoritması önerilmiş, tohum URL setleri kullanılarak taramalar yapılmış, karşılaştırılmış ve detaylı analizleri yapılmıştır.
... A good selection of seed pages guarantees that enough pages from different communities related to the current topic will be sampled and the crawler exploits the topical locality for finding additional pages in comparison with crawl starting from random seeds. For instance, Daneshpajouh et al. (Daneshpajouh et al., 2008) compared various community-based algorithms for discovering good seeds from previously crawled web graphs and discovered that HITS-based ranking is a good approach for this task. Of course, the seed page identification should not be too expensive in terms of computational time. ...
Conference Paper
Full-text available
Focused crawling is increasingly seen as a solution to increase the freshness and coverage of local repository of documents related to specific topics by selectively traversing paths on the web. The adaptation is a peculiar feature that makes it possible to modify the search strategies according to the particular environment, its alterations and its relationships with the given input parameters during the search. This paper introduces a general evaluation framework for adaptive focused crawlers.
... HITS algorithm is used to generate crawler seeds set. (Shervin et al., 2003) Their research considered that it's better to crawl the most important web pages on the resource limited internet. They use the collected web pages to draw a web graph and perform the HITS algorithm to generate seeds set for crawler. ...
Article
Full-text available
In recent years, more and more CJK (Chinese, Japanese, and Korean) web pages appear in the Internet. The infor-mation in the CJK web page also be-comes more and more important. Web crawler is a kind of tool to retrieve web pages. Previous researches focused on English web crawlers and the web crawler is always optimized for English web pages. We found that the perform-ance of the web crawler is worse in re-trieving CJK web pages. We tried to en-hance the performance of the CJK crawler by analyzing the web link struc-ture, anchor text, and host name on the hyperlink and changing the crawling al-gorithm. We distinguish the top-level domain name and the language of the anchor text on hyperlinks. The method that distinguishes the language of the an-chor text on hyperlinks is not used on CJK language specific crawler by other researches. Control experiment is used in this research. According to the experi-mental results, when the target crawling language is Japanese, the 87% of the crawled web pages are Japanese web pages and improves the efficiency about 0.24% compares to the baseline results. When the target crawling language is Chinese, the 88% of the crawled web pages are Chinese web pages and im-proves the efficiency about 0.07% com-pares to the baseline results. When the target crawling language is Korean, the 71% of the crawled web pages are Ko-rean web pages and improves the effi-ciency about 10% compares to the baseline results.
... Web crawler is also called as software agent. In general, it starts with a list of URLs [10] to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. ...
Article
Full-text available
World Wide Web consists of more than 50 billion pages online. It is highly dynamic i.e. the web continuously introduces new capabilities and attracts many people. Due to this explosion in size, the effective information retrieval system or search engine can be used to access the information. In this paper we have proposed the EPOW (Effective Performance of WebCrawler) architecture. It is a software agent whose main objective is to minimize the overload of a user locating needed information. We have designed the web crawler by considering the parallelization policy. Since our EPOW crawler has a highly optimized system it can download a large number of pages per second while being robust against crashes. We have also proposed to use the data structure concepts for implementation of scheduler & circular Queue to improve the performance of our web crawler.
Chapter
This article presents an automata SOA based security model against competitive intelligence attacks in e-commerce. It focuses on how to prevent conceptual interception of an e-firm business model from CI agent attackers. Since competitive intelligence web environment is a new important approach for all e-commerce based firms, they try to come in new marketplaces and need to find a good customer-base in contest with other existing competitors. Many of the newest methods for CI attacks in web position are based on software agent facilities. Many researchers are currently working on how to facilitate CI creation in this environment. The aim of this paper is to help e-firm designers provide a non-predictable presentation layer against CI attacks.
Article
The efficiency of public investments and services has been of interest to geographic researchers for several decades. While in the private sector inefficiency often leads to higher prices, loss of competitiveness, and loss of business, in the public sector inefficiency in service provision does not necessarily lead to immediate changes. In many cases, it is not an entirely easy task to analyze a particular service as appropriate data may be difficult to obtain and hidden in detailed budgets. In this paper, we develop an integrative approach that uses cyber search, Geographic Information System (GIS), and spatial optimization to estimate the spatial efficiency of fire protection services in Los Angeles (LA) County. We develop a cyber-search process to identify current deployment patterns of fire stations across the major urban region of LA County. We compare the results of our search to existing databases. Using spatial optimization, we estimate the level of deployment that is needed to meet desired coverage levels based upon the location of an ideal fire station pattern, and then compare this ideal level of deployment to the existing system as a means of estimating spatial efficiency. GIS is adopted throughout the paper to simulate the demand locations, to conduct location-based spatial analysis, to visualize fire station data, and to map model simulation results. Finally, we show that the existing system in LA County has considerable room for improvement. The methodology presented in this paper is both novel and groundbreaking, and the automated assessments are readily transferable to other counties and jurisdictions.
Article
Full-text available
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Article
Full-text available
The vast improvement in information access is not the only advantage resulting from the increasing percentage of hyperlinked human knowledge available on the Web. Additionally, much potential exists for analyzing interests and relationships within science and society. However, the Web's decentralized and unorganized nature hampers content analysis. Millions of individuals operating independently and having a variety of backgrounds, knowledge, goals and cultures author the information on the Web. Despite the Web's decentralized, unorganized, and heterogeneous nature, our work shows that the Web self-organizes and its link structure allows efficient identification of communities. This self-organization is significant because no central authority or process governs the formation and structure of hyperlinks
Article
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Conference Paper
The study of the Web as a graph is not only fascinating in its own right, but also yields valuable insight into Web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the Web graph using two AltaVista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the Web is considerably more intricate than suggested by earlier experiments on a smaller scale.
Article
The Web harbors a large number of communities — groups of content-creators sharing a common interest — each of which manifests itself as a set of interlinked Web pages. Newgroups and commercial Web directories together contain of the order of 20,000 such communities; our particular interest here is on emerging communities — those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a Web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment.
Article
In this paper, we describe six algorithmic problems that arise in web search engines and that are not or only partially solved: (1) Uniformly sampling of web pages; (2) modeling the web graph; (3) finding duplicate hosts; (4) finding top gainers and losers in data streams; (5) finding large dense bipartite graphs; and (6) understanding how eigenvectors partition the web.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Article
Studying Web graphs is often difficult due to their large size. Recently, several proposals have been published about various techniques that allow to store a Web graph in memory in a limited space, exploiting the inner redundancies of the Web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large Web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3.08 bits per link, and its transposed version in as little as 2.89 bits per link.
Article
We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.