Conference PaperPDF Available

A Fast Community Based Algorithm for Generating Web Crawler Seeds Set.

January 2008

January 2008

Source
DBLP

Conference: WEBIST 2008, Proceedings of the Fourth International Conference on Web Information Systems and Technologies, Volume 2, Funchal, Madeira, Portugal, May 4-7, 2008

Authors:

Shervin Daneshpajouh

Sharif University of Technology

Mohammad Ghodsi

Sharif University of Technology

In this paper, we present a new and fast algorithm for generating the seeds set for web crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then continues crawling from URLs which are found in these web pages. Crawlers are supposed to download more good pages in less iteration. Crawled pages are good if they have high PageRanks and are from different communities. In this paper, we present a new algorithm with running time O(n) for generating crawler's seeds set based on HITS algorithm. A crawler can download qualified web pages, from different communities, starting from generated seeds set using our algorithm in less iteration.

Relation Between Community Extracted by NetworkFlow and Its relation to HITS Bipartite Core

…

. UK 2002 and 2005 Data Sets information before pruning

…

Seeds Extraction Algorithm

…

Log-Log diagram of Hub and Authority sizes Extracted from UK 2005 in different iteration

…

Figures - uploaded by Shervin Daneshpajouh

Content may be subject to copyright.

Content uploaded by Shervin Daneshpajouh

Content may be subject to copyright.

A Fast Community Based Algorithm for

Generating Web Crawler Seeds Set

Shervin Daneshpajouh, Motaba Mohammadi Nasiri

Computer Engineering Department, Sharif University of Technology, Azadi Avenue, Tehran, Iran

daneshpajouh@ce.sharif.edu, m_mohammadi@ce.sharif.edu

Mohammad Ghodsi

Computer Engineering Department, Sharif University of Technology, Azadi Avenue, Tehran, Iran

ghodsi@sharif.edu

Keywords: Crawling, Communities, Seed Quality Metric, Crawl Quality Metric, HITS, Web Graph,

Hyperlink Analysis.

Abstract: In this paper, we present a new and fast algorithm for generating the seeds set for web

crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then

continues crawling from URLs that are found in these web pages. Crawlers are supposed to

download more good pages in less iteration. Crawled pages are good if they have high

PageRanks and from different communities. In this paper, we present a new algorithm with

running time O(n) for generating crawler's seeds set based on HITS algorithm. A crawler

can download qualified web pages, from different communities, starting from seeds set

generated using our algorithm in less iteration.

1. Introduction

Nowadays web has an imperative impact on our daily life

providing required information. Based on [1], the size of

web was estimated 11.5 billion pages at 2005. This size is

now even larger and become larger as the time elapse.

Web search engines like Google, Yahoo, MSN,… have

important roles for facilitating information access. A web

search engine consists of three main parts: A crawler that

retrieves web pages, an indexer that builds indexes, and a

searcher. A major question a crawler has to face is that

which pages it has to retrieve so as to have the "most

suitable" pages in the collection [3]. Crawlers normally

retrieve a limited number of pages. In this regard, the

question is how fast a crawler collects the "most suitable"

pages. A unique solution to this question is not likely to

exist. In what follows, we try to answer this question.

Different algorithms with different metrics have been

suggested to lead a crawl towards high quality pages

[4,5]. In [4] Cho, Garcia-Molina, and Page suggested

using connectivity-based metrics to do so. To direct a

crawl, they have used different ordering metrics: breadth-

first, backlink count, PageRank, and random. They have

revealed that performing a crawl in breadth-first order

works nearly well if "most suitable" pages are defined to

be pages with high PageRanks.

Najork and Wiener extend the result of Cho et al. They

examined the average page quality over the time of pages

downloaded during a web crawl of 328 million unique

pages. They have showed that traversing the web graph in

breadth-first search order is a good crawling strategy.

Based on Henzinger's work [3] better understanding of

graph structure might lead to a more efficient way to

crawl the web. We use this idea in this paper to develop

our algorithm. First, we define the "most suitable" pages

and then we show how a crawler can retrieve most

suitable pages. We use three metrics to measure the

quality of a page. The first metric is the community of a

page. A collection of good crawls should contain pages

from different communities. The second metric is its

PageRank [2]. Pages with high PageRanks are the most

important pages in web. The third metric is number of

visited pages at iterations. A good crawler will visit more

pages in less iteration.

In this paper, we present a new fast algorithm for

extracting seeds set from a previously crawled pages.

Using offered metrics, we show that by starting from

extracted seeds suggested by our algorithm, a crawler will

quickly collect most suitable pages from different

communities.

We have studied different community extraction

algorithms: PageRank, Trawling, HITS, and Network

flow base community discovery. From our analysis, we

decided to use HITS ranking without keyword search in

our algorithm for community discovery and collecting

seeds set. We have found that bipartite cores are useful

for selecting seeds set. Bipartite cores contain Hub and

Authority pages. Since we are interested in having

Authority pages in our crawl, we would need to start

crawling from Hub pages. Hubs are durable pages, so we

can count on them for crawling.

The main idea in our method is to use HITS_Ranking on

the whole graph for extracting the most important

bipartite cores. We offer two bipartite core extraction

algorithms. Using these algorithms, we extract bipartite

cores and select some seeds from the hubs in the

extracted cores. Finally, we remove the extracted bipartite

core from the graph and repeat these steps till having

desired number of seeds.

We have compared the results of the crawls starting from

extracted seeds set produced by our algorithm, with

crawls starting random nodes. Our experiments show that

the crawl staring from seeds set identified by our

algorithm finds the most suitable pages of web very faster

than a random crawler did.

According to our knowledge, this is the first seeds

extraction algorithm that is able to identify and extract

seeds from different web communities. Low running time

is crucial in working with large size web data. The

running time of proposed algorithm is O(n). Low running

time with community base properties make this algorithm

unique in comparison with previous algorithms.

The remainder of this paper proceeds as follows. In

Section 2, we present our algorithm for discovering seed

sets in a large web graph and compute the complexity of

proposed algorithm. In section 3, we discuss the results of

running and evaluating this algorithm on 18M and 39M

node graphs. Section 4 contains conclusion and future

works.

2. ALGORITHM for DISCOVERING

SEED SETS in LARGE WEB GRAPH

In this section, we present our algorithm for discovering

seeds sets from web graph. Firstly, we start with a

discussion about the web structure. This will give some

intuition about our algorithm to the reader. The connected

web's macroscopic structure breaks into four pieces [6]:

SCC, IN, OUT, TENDRILS. SCC consist all of whose

pages can reach one another along directed link. IN

consists of pages that can reach the SCC, but cannot be

reached from it. OUT consists of pages that are accessible

from SCC, but do not link back to it. Finally, the

TRENDRILS contain pages that cannot reach the SCC

and cannot be reached from SCC.

From the structure and analysis presented in [6], the most

important web pages are expected to be in SCC+OUT.

From [2] we know that web pages with high PageRanks

are the most valuable pages in the web. In addition, from

[3, 7, 8] we understand that bipartite cores are one of the

most valuable sources in the web usually called hubs and

authorities. Beside, we know that the web contains

thousands of different communities. In [7] Kleinberg

used keywords and a ranking method for finding hubs

and authorities pages which result contains pages in one

or more communities. In [9] Flake et al suggested a

method based on network flow algorithm for finding web

communities. The relation between communities

extracted from their algorithm and Kleinberg's

community is that the result of the former is expected to

be a subset of Flake et al algorithm [10]. See Figure 1 for

a sample relationship between results of these two

algorithms.

A crawler normally do not crawl the entire web. Instead,

it continues to retrieve a limited number of pages.

Crawlers are expected to collect "most suitable" pages of

web rapidly. We defined "most suitable" pages of web as

those pages with high PageRank. These pages are pages

that HITS algorithm calls them Authority pages. The

difference is that HITS algorithm finds the authorities

pages relating to keywords but PageRank shows the

importance of a page in the whole web. As well, we know

that good hubs link to good authorities. If we are able to

extract good hubs from a web graph and different

communities, we will be able to download good

authorities that have high PageRank of different

communities.

We use HITS ranking without keywords search on

previously crawled web pages and then prune the

resulting sub graph. The procedure of using HITS ranking

and the elimination will be repeated until enough seed

points are reached. We show that selected hub nodes

from the resulted sub graph of iterations can solve the

crawl problem mentioned in this paper.

Figure 1. Relation Between Community Extracted by

NetworkFlow and Its relation to HITS Bipartite Core

2.1. Iterative HITS Ranking & Pruning

We assume that we have a web graph of crawled web

pages. The goal is to extract seeds set from this graph so

that a crawler can collect the most important pages of the

web in less iteration. To do this we run HITS ranking

algorithm on this graph. This is the second step of HITS

algorithm. In the first step, it searches the keywords in an

index-based search engine. For our purpose, we ignore

this step and only run the ranking step on the whole

graph. In this way, bipartite cores with high Hub and

Authority rank will become visible in the graph. Then we

select the most highly ranked bipartite core using two

algorithms we suggest, namely: extracting seeds with

fixed size, and extracting seeds with fixed density. Then,

we remove this sub-graph from the graph, repeat ranking,

seed extraction, and sub-graph removal steps up to a

point that we have enough seeds set.

A question that may arise is that when repeating these

steps, why we need to run HITS ranking again. Isn't one

time ranking enough for whole steps? The answer is that,

removing bipartite core in each step modifies the web

graph structure we are working on. In fact, re-ranking

changes the hub and authority ranks of bipartite cores.

Removing high ranked bipartite core and re-ranking web

graph drive appeared bipartite cores to be from different

communities. Thus, a crawler will be able to download

pages from different communities starting from these

seeds. We have experimented our algorithm using web

graph of UK 2002 containing 18M nodes and 298M

edges, and UK 2005 containing 39M nodes and 936M

edges [6, 7]. Our experiments prove that extracted

bipartite cores have a reasonable distance from each

other.

The other question that may arise is that if a crawler starts

from seeds resulted from our algorithm, why would the

results of crawl lead to the most suitable pages. The

answer is that, in iterations of algorithm, we select and

extract high ranked bipartite cores from the web graph.

Extracted bipartite cores have high hub or authority

ranks. It is expected that pages with high hub rank link to

pages with high PageRank. Our experiments prove the

correctness of this hypothesis.

2.2. HITS Ranking

Ranking is the second step of HITS algorithm [7]. In our

algorithm, the HITS-Ranking procedure is given a

directed graph G as an input and HitsIterationCount. The

procedure defines two h and a vectors for Hub and

Authority rank of nodes. Their initial value is set to 1.

Then algorithm updates h and a vectors of all nodes in the

input graph. Afterwards, the updated vectors are

normalized to 1. The Algorithm repeats vector updates

and normalization steps until it reaches the required

number of iterations. Two vectors h and a are returned as

the results of HITS-Ranking procedure. Figure 2 shows

HITS-Ranking algorithm.

2.3. Extracting Seeds with Fixed Size

The Extract-Bipartite-Cores-with-Fixed-Size procedure,

as its name indicates, extracts one bipartite sub-graph

with highest hub and authority ranks with predetermined

size given as an input. The Algorithm is given a directed

graph G, BipartiteCoreSize, NewMemberCount and h,

and a vectors. BipartiteCoreSize specifies the desired size

of bipartite core we like to be extracted.

NewMemberCount indicates that in each iteration of the

algorithm how many hub or authority nodes should be

added to the hub or authority sets. h and a vectors are hub

and authority ranks of nodes in the input graph G.

In the initial steps, the Algorithm sets HubSet to empty

and adds the node with highest authority rank to

AuthoritySet. While sum of AuthoritySet size and HubSet

size is less than BipartiteCoreSize it continues to find

Procedure HITS-Ranking

Input: graph: G=(V,E) , integer:HITSIterationCount

1) For all v in V do

2) Set a(v) = 1;

3) Set h(v) = 1;

4) End For

5) For i=1 to HITSIterationCount do

6) For all v in V do

7) ∑

∈

Evw

whva

),(

)()( ;

8) End For

9) For all v in V do

10) ∑

∈

Ewv

wavh

),(

)()( ;

11) End For

12) Normalize h vector to 1;

13) Normalize a vector to 1;

14) End For

output: h, a

End Procedure

Figure 2. HITS Ranking Algorithm

Figure 3. Extracting Bipartite Cores with Fixed Size

Procedure Extract-Bipartite-Cores-with-Fixed-Size

Input: graph: G=(V,E) , integer: BipartiteCoreSize,

NewMemberCount;

vector: h,a.

1) HubSet =

∅

;

2) AuthoritySet= Add v with highest a(v) to AuthoritySet;

3) While |AuthoritySet| + |HubSet| < BipartiteCoreSize do

4) AuthoritySet= AuthoritySet ∪ (Find Top

NewMemberCount h(v) where v,w∈ E

and w in AuthoritySet and v not in HubSet);

5) HubSet = HubSet ∪ (Find Top

NewMemberCount a(v) where v,w∈ E and v in

AuthoritySet and w not in HubSet );

6) End While

output: HubSet, AuthoritySet

End Procedure

new hubs and authorities regarding the

NewMemberCount and adds them to the related set. We

use this procedure when we like to extract bipartite sub-

graph with fixed size. Figure 3 shows the details of

Extract-Bipartite-Cores-with-Fixed-Size procedure. In

Figure 4 we show the steps of bipartite sub-graph creation

with NewMemberCount equal to 1. An interesting result

we have found in our experiments is that at very first

steps, all the hubs have links to all authorities. This

leaded us to suggest an extraction algorithm with density

factor that is described in the following subsection.

2.4. Extracting Seeds with Fixed Cover

Density

The Extract-Bipartite-Cores-with-Fixed-CoverDensity

procedure, as its name indicates, extracts one bipartite

sub-graph with highest hub and authority ranks in a way

that the sub-graph has the desired cover-density function.

A directed graph G, CoverDenstity, and h,and a vectors

are given to the algorithm.

We define Cover-Density as follows:

(1)

|||| |),(|

*100 etAuthoritySHubSet etAuthoritySHubSetE

This measure shows how many nodes in the authority set

are covered by nodes in hub set. If the bipartite sub-graph

is a complete bipartite sub-graph, this measure will be

equal to 100. Therefore, if we intend to extract complete

bipartite sub-graph we set CoverDenstity to 100. h and a

vectors are hub and authority ranks of nodes in the input

graph G.

In initial steps, the Algorithm sets HubSet to empty set

and adds the node with highest authority rank to

AuthoritySet. In addition, it sets CoverDensityCur to 100.

While CoverDensityCur is bigger than or equal to input

CoverDensity, procedure continues to find new hubs and

authorities. This algorithm adds only one new node to the

sets at each iteration of the algorithm. Remember that in

Extract-Bipartite-Cores-with-Fixed-Size we could adjust

the count of new members. Here, we do not have such a

variable. This is because of the fact that we like to have a

precise cover density here. In other words, if we increase

the number of new nodes to more than 1, this might cause

the reduction of the accuracy of desired cover density.

We use this procedure when we like to extract bipartite

sub-graph with desired density between hubs and

authorities. Figure 5 shows the details of the Extract-

Bipartite-Cores-with-Fixed-CoverDensity procedure.

2.5. Putting It All Together

Up to now, we have presented algorithms for HITS-

Ranking and bipartite core extraction based on hub and

authority ranks. Our goal is to extract a set of desired

number of seeds to crawl and download pages from

different web communities with high PageRank in less

iteration. We use the proposed algorithms to achieve this

goal. We assume that we have a web graph of crawled

web pages. Then we run HITS-Ranking algorithm on the

whole graph and use one of the bipartite core extraction

algorithms we have presented. Then we select arbitrarily

one of the nodes in the extracted hub set and add it to our

seeds set. Finally, we remove the extracted core from the

input graph and repeat these steps until we find ideal

number of seeds.

We can use one of these two bipartite-core extraction

algorithms that we have proposed: Extract-Bipartite-

Cores-with-Fixed-Size, Extract-Bipartite-Cores-with-

Fixed-CoverDensity. If we wish bipartite cores to have a

fixed size we use the first algorithm and if we are looking

for bipartite cores having desired cover density, then we

use the second algorithm. For example, if we like

bipartite cores to be complete we must use the second

algorithm. We have experimented both of these

algorithms. As we cannot guess suitable size of a web

community, we use the second method. The second

method can calculate density of links between hubs and

authorities. If we have a complete bipartite core then we

are sure that the all the authority pages are from the same

community. By decreasing the Cover-Density measure,

we decrease the degree of relationship between authority

pages. Because the second method is more reliable than

the first one, in this paper we only present experimental

results achieved from using Extract-Bipartite-Cores-with-

Figure 4. Steps of bipartite sub-graph creation with

NewMemberCount equal to 1. (a) Shows the sub-graph

after adding the highest authority rank node and

adding authority with highest rank that refer to this

authority node. In (b), the next authority with highest

rank which is not in authority set and linked by the only

node in the hub set. In (c), the second hub node with

highest hub rank which was not already in the hub set

and linked to one of the nodes in authority set. In (d)

resulted sub-

h after 4 ste

s is shown.

(a)

(b)

(c)

(d)

Fixed-CoverDensity. Figure 6 shows the seed extraction

algorithm we have used in our experiments in this paper.

The Extract-Seed algorithm receives a directed graph G

and SeedCount as input. At the initial step, the algorithm

set SeedSet to empty. While the size of SeedSet is less

than SeedCount, the algorithm keeps running. In the first

line of while section, algorithm calls HITS-Ranking

procedure with G as the input graph and 60 as

HITSIterationCount. Kleinberg's work shows that

HITSIterationCount equal to 20 is enough for

convergence of hub and authority ranks in a small sub-

graph [7]. We have found experimentally that a number

more than 50 is enough for convergence of hub and

authority ranks with the dataset we use. HITS-Ranking

algorithm returns two vectors, h and a, containing result

of hub and authority ranks of all nodes in graph G. In the

next line algorithm calls Extracting-Bipartite-Cores-with-

Fixed-CoverDensity with G as input graph, 100 as cover

density value, and h and a as hub and authority vectors.

This function finds complete bipartite cores in the input

graph and returns complete bipartite nodes in HubSet and

AuthoritySet. In the next line, a node randomly is selected

from hub set and it is added to the SeedSet. Now

algorithm removes the hub and authority nodes and their

edges from the graph G. The removal step helps us to

find seeds from different communities.

2.6. Complexity of Proposed Seeds

Extraction Algorithm

The running time of Seeds-ExtractionAlgorithm, Figure

6, is O(n), where n is number of nodes in the input graph.

The while loop of lines 2-12 is executed at most

|SeedCount| times. The work of line 4 is done in O(n).

Because, the complexity of HITS-Ranking, Figure 2, is

equal to

.(K*2*L*n) where K is |HitsIterationCount|, L

is the average number of neighborhoods of a node and n

is number of nodes in the input graph. This complexity is

multiplied by 2 because there are two steps for this kind

of computation, one for hub vector and the other for

authority vector. In addition, the normalization steps can

be done in

(3n). So, the complexity of HITS-Ranking

is O(n).

The running time of Extracting-Bipartite-Cores-with-

Fixed-CoverDensity in line 4 is O(n). The while loop of

lines 4-8, in figure 5, is executed at most |HubSet +

AuthoritySet| times which can be viewed as a constant

number k. Finding and adding a distinct hub node with

highest rank to hub set, in line 5, takes

(k*n). Finding

and adding a distinct authority node with highest rank to

authority set, in line 6, takes

(k*n). So, the running time

of Extracting-Bipartite-Cores-with-Fixed-CoverDensity

is at most O(n).

The removal steps of lines 6-11, in figure 6, takes O(n)

for removing identified hubs and authorities.

Therefore, the total running time of Seed-Extraction

Algorithm is O(|SeedCount|*n) which is equal to O(n).

3. EXPERIMENTAL RESULTS

In this section, we apply our proposed algorithm, to find

seeds set from previously crawled pages. Then, we start a

crawl using extracted seed on the same graph to evaluate

the result. To show that how applying the algorithm on

old data can provide good seeds for a new crawl, we start

the crawl on a newer graph using seeds set extracted from

a previous crawl.

3.1. Data Sets

The laboratory for Web Algorithmics at university of

Milan provides different web graph data sets [11]. In our

experiments, we have used UK-2002 and UK-2005 web

graph data sets provided by this laboratory. These data

sets are compressed using WebGraph library. WebGraph

is a framework for studying the web graph [12]. It

provides simple ways to manage very large graphs,

exploiting modern compression techniques. With

WebGraph, we can access and analyze a very large web

graph on a PC.

Figure 6. Seeds Extraction Algorithm

Procedure Extract-Seeds

Input: graph: G=(V,E) , integer: SeedCount;

1) SeedSet = ∅

2) While |SeedSet| < SeedCount do

3) h, a = HITS-Ranking( G , 60);

4) HubSet, AuthoritySet = Extracting-Bipartite-Cores-

with-Fixed-CoverDensity(G, 100, h, a);

5) SeedsSet = SeedsSet ∪ Select a node arbitrarily

from HubSet;

6) For all v in HubSet do

7) Remove v and all E(v) from G;

8) End For

9) For all v in AuthoritySet do

10) Remove v and all E(v) from G;

11) End For

12) End While

output: SeedsSet

End Procedure

Figure 5. Extracting Bipartite Cores with Fixed Density

Procedure Extract-Bipartite-Cores-with-Fixed-CoverDensity

Input: graph: G=(V,E) , integer: CoverDensity;

vector: h,a.

1) HubSet =

∅

;

2) AuthoritySet = Add v with highest a(v) to

AuthoritySet;

3) CoverDensityCur = 100;

4) While CoverDensityCur ≥ CoverDensity do

5) AuthoritySet= AuthoritySet ∪ (Find Top

NewMemberCount h(v) where v,w∈ E

and w in AuthoritySet and v not in HubSet);

6) HubSet = HubSet ∪ (Find Top

NewMemberCount a(v) where v,w∈ E and v in

AuthoritySet and w not in HubSet );

7) CoverDensityCur =

|||| |),(|

*100 etAuthoritySHubSet etAuthoritySHubSetE ;

8) End While

output: HubSet, AuthoritySet

End Procedure

0246810

log(In-Degree)

02468

log(Out-Degree)

0246810

log(In-Degree)

02468

log(Out-Degree)

Table 1. UK 2002 and 2005 Data Sets information

before pruning

Data Set Nodes Edges Diameter

Estimate

UK-2002 18,520,486 298,113,762 14.9

UK-2005 39.459.935 936,364,282 15.7

Table 2. UK 2002 and 2005 Data Sets information

after prunning

Data Set Nodes Edges

UK-2002 18,520,486 22,720,534

UK-2005 39.459.935 183,874,700

0.5

1.5

2.5

3.5

1234567891011

HUB

Authority

0.5

1.5

2.5

3.5

12345678910

HUB

Authority

Figure 10. Log-Log Out-Degree UK 2005

ure 9. Lo

-Lo

In-De

ree UK 2005.

Figure 8. Log

Log OutDegree UK 2002

Figure 7. Log-Log In-Degree UK 2002.

Figure 12. Log-Log diagram of Hub and Authority sizes

Extracted from UK 2005 in different iteration

Figure 11. Log-Log diagram of Hub and Authority sizes

Extracted from UK 2002 in different iteration

3.1.1. UK-2002

This data set has been obtained from a 2002 crawl of the

.uk domain performed by UbiCrawler in 2002 [13]. The

graph contains 18,520,486 nodes and 298,113,762 links.

3.1.2. UK-2005

This data set has been obtained from a 2005 crawl of the

.uk domain performed by UbiCrawler in 2005. The crawl

was very shallow, and aimed at gathering a large number

of hosts, but a small number of pages from each host.

This graph contains 39.459.935 nodes and 936,364,282

links.

3.2. Data Set Characteristics

3.2.1. Degree Distribution

We had investigated the degree distribution of UK-2002

and UK-2005. Figure 7 and 8 show In-degree and Out-

degree distribution for UK-2002 in log-log form. Figure 9

and 10 show In-degree and Out-degree distribution for

UK-2005 in log-log form. The results show that the In-

degree and Out-degree distribution are power laws in

these two datasets.

3.2.2. Diameter

Diameter of a web graph is defined as the length of

shortest path from u to v, averaged over all ordered pairs

(u,v) [14]. Off course, we omit the infinite distance

between pairs that there is not a path between them. This

is called average connected distance in [6]. We had

estimated this measure on UK-2002 and UK-2005 data

sets through experiments. Table 1 shows the estimated

diameter of these data sets together with the number of

nodes and edges.

We use the resulted diameter to evaluate the distances

between extracted bipartite cores resulting from our

method.

3.3. Date Preparation

3.3.1. Prunning

Most of the links between pages in a site are for

navigational purposes. These links may distort the result

of presented algorithm. The result of the HITS-Ranking

algorithm on this un-pruned graph will result in hub and

authority pages to be found in a site. To eliminate this

effect we remove all links between pages in the same site.

We assume pages with the same host-name are in the

same site. Table 2 shows the number of nodes and edges

after pruning in the UK data sets.

0.00E+00

5.00E-05

1.00E-04

1.50E-04

2.00E-04

2.50E-04

3.00E-04

3.50E-04

Iteration

PageRank

Our Algorithm Mean PageRank Random Algorithm Mean PageRank

1 2 3 4 5 6 7 8 9 101112131415161718

Iteration

Log-Count Traversed pages

Our Method Random

Figure 15. Comparison between Log Count diagram of

pages visited at each Iteration starting from 10 seeds

extracted by our method and 10 seeds selected

randomly from uk-2002

Figure 14. Comparison between PageRank of Crawled

pages starting from 10 seeds extracted by our method

on UK-2002 and Random Seeds

Figure 13. Graphical representation of Distances

between 56 extracted seeds from UK-2002 by our

algorithm. Numbers besides of each node indicate the

iteration number in which related node has been

extracted.

3.4. Results of Extracted Seeds using

Proposed Algorithm

We run our algorithm for seeds extraction, Extract-Seeds

, on UK-2002 and UK-2005. This algorithm, as Figure 6

shows, sets CoverDensity to 100 for seeds extraction. It

searches and extracts complete bipartite cores and then, at

each step, selects the seeds from hub nodes in the

bipartite sub-graph (see figure 4). Figure 11 shows the

size of extracted hubs and authorities in different iteration

from UK-2002. It is clear that these cores are complete-

bipartite. To reduce the impact of outlier hub sizes in the

graphical presentation, we have used a log-log diagram.

Figure 12 depicts the size of the extracted hubs and

authorities in different iterations for UK-2005.

Normally the hub sizes are bigger than the authority

sizes. We obtained bipartite cores with very large hub

size in UK-2002. So, we have limited the number of hubs

to 999 in UK-2002 data set.

3.5. Quality Analysis

3.5.1. Metrics for Analysis

We used some different metrics to evaluate the quality of

extracted seeds. The first metric is the distance between

extracted seeds. As we have mentioned earlier, a crawler

tends to extract web pages from different communities.

Using HITS-Ranking and Iterative pruning we can

conclude that extracted seeds are from different

communities. To prove the intuition, we measure the

distances between extracted cores. We have defined core-

distances as the nearest directed path between one of the

nodes from source core to one of the nodes in the

destination core.

The second metric, is the PageRank of pages that will be

crawled starting from these seeds. Previously, we have

defined the most suitable pages in the web to be the pages

with high PageRanks. Therefore, if the average PageRank

of crawled pages, at each step of crawl, is bigger than a

random crawl, especially at the beginning, then we can

conclude that a crawl that starts from those seeds

identified by our algorithm will result in better pages.

The third metric is the number of crawled pages at each

step of crawling. We focus on a crawler whose goal is to

download good pages in small iteration. Thus, if the

number of crawled pages staring from extracted seeds by

our method at each step is bigger than the number of

crawled pages starting a random set, then we can

conclude that, our method leads a crawl toward visiting

more pages in less iteration too.

For the first metric, we measure the distance between the

cores. For the other two metrics, we need to crawl the

graph starting from seeds extracted with our method and

compare it with a crawl staring from randomly selected

seeds.

3.5.2. Result of Bipartite Core Distances

We have measured the distance between all bipartite

cores that were extracted from UK datasets and they had

a reasonable distance in comparison with diameter of the

related graph. Figure 13, shows the graphical

representation of distances between 54 cores extracted

from UK-2002. The number on the top of each node

indicates the iteration number in which the core has been

extracted. Because the distance graph between nodes may

have not a Euclidian representation, distances in this

figure is not exactly match with real distances. The other

important information is that bipartite cores in close

iterations have a distance equal or bigger than average

distance. In addition, the cores that are close to each other

(there is a short directed path between them) are

identified in far iteration. As an example, the core

extracted from iteration 32 has a distance one to the core

extracted from iteration 47. In the sample, the minimum

distance between nodes is 1 and maximum distance is 13.

The average distance is 7.15. As the diameter of UK-

2002 data set is 14.9, core distances are fine.

3.5.3. Result of Average PageRank and Visit

Count

In this section, we evaluate the second and third metrics

defined for evaluation. For UK-2002 we have executed

the Extract-Seed algorithm with SeedCount=10.

Therefore, the algorithm extracts one seed from each core

Figure 17. Comparison between Log Count Diagram of

pages visited at each Iteration starting from 10 seeds

extracted by our method from UK-2005 and 10 seeds

selected randomly on UK-2005

Figure 16. Comparison between PageRank of Crawled

pages starting 10 seeds extracted from UK-2005 by our

method and 10 Random seeds selected from UK-2005

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Our Method-Seed 2005 Random Seeds

0.00E+00

2.00E-04

4.00E-04

6.00E-04

8.00E-04

1.00E-03

1.20E-03

1.40E-03

1.60E-03

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Iteration

PageRank

Our Method-Seed 2005 Random Seeds

in iteration. Then, we have started a crawling on UK-

2002 data set implementing BFS strategy and measured

the average PageRank of visited pages in each crawl

depth, and the number of pages visited in each crawl

depth. Then, we have compared the results with results

gained from a crawl starting from random seeds for the

same graph.

Figure 14 shows the comparison of average PageRank of

crawl starting from seeds extracted with our method and a

crawl starting from random seeds. Except the first depth

(iteration) of crawl, in the other steps, up to step 4, the

average PageRank of pages crawled with our method

appear to be better. Specially, in the second and third

iterations, the difference is superior. In the later iterations

average PagRank of visited pages are close.

Figure 15 shows the comparison of log-number of pages

visited in each depth of crawl on UK-2002. For better

graphical representation, we have computed log-count of

visited pages. Apparently, the results of our method is

always better than crawl starting random seeds and a

crawl with seeds extracted with our method downloads

more pages in less iteration. Figure 16 and 17 show the

experiments on UK-2005. The same results appear here

too.

3.5.4. Good Seeds for a New Crawl

Using proposed algorithm we have discovered seeds from

UK 2002 and UK 2005. Then we have evaluated the

goodness of these seeds using three evaluation criteria.

These evaluations are goods, but a real crawler has not

access to seeds of web graph which it is going to crawl..

We should show that the result is always good if we start

the crawl using seeds extracted from an old crawled

graph.

In this section, we show the result of crawling on UK

2005 using seeds extracted by proposed algorithm and we

compare it by randomly crawled seeds to simulate the

real environment. Before algorithm's execution, we have

checked the validity of seeds found from UK-2002 in

UK-2005 data set. If a seed does not exist in newer graph,

then we remove that seed from our seeds set. Our

experiments show that only 11 present of seeds exist in

the new data set. In fact, we have extracted 100 seeds

from UK 2002 to be sure that we have 11 valid seeds in

UK 2005.

Figure 18 shows the comparison of average PageRank of

crawl starting from seeds extracted with our method and a

crawl starting from random seeds. The result of our

method is better until iteration 3. Figure 19, shows the

comparison of log-number of pages visited in each depth

of the crawl. In this case, the result of our method is

better than the random case between 4 and 15 steps. In

fact our method download pages with high page rank till

iteration 3 and after that it crawls more pages than the

random case till iteration 15. After that the result is nearly

the same. Therefore, we can conclude that, a crawler can

download qualified web pages in less iteration; starting

seed sets generated using our algorithm in less iteration.

0.00

5.00

10.00

15.00

20.00

25.00

30.00

1 3 5 7 9 111315171921232527293133353739

Our Method Seeds from UK 2002 Random Seeds from UK 2005

13579111315171921232527293133353739

Our Method Seeds from UK2002 Random Seeds from UK2005

4. CONCLUSION and FUTURE

WORKS

Crawlers like to download more good pages in little

iteration. In this paper, we have presented a new fast

algorithm with running time O(n) for extracting seeds set

from previously crawled web pages. In our experiments

we have showed that if a crawler starts crawling from

seeds set identified by our method, then it will crawl

more pages with higher PageRank in less iteration and

from different communities than starting a random seeds

set. In addition, we have measured the distance between

selected seeds to be sure that our seeds set contains nodes

from different communities. According to our

knowledge, this is the first seeds extraction algorithm that

is able to identify and extracts seeds from different

communities.

Our experiments were on graphs containing at most 39M

nodes and 183M edges. This method can be experienced

on larger graph in order to investigate the quality of result

on them too.

Figure 19. Comparison between Log Count Diagram of

pages visited at each Iteration starting from 11 seeds

extracted by our method from UK-2002 and 11 seeds

selected randomly on UK-2005

Figure 18. Comparison between PageRank of Crawled

pages starting 11 seeds extracted from UK-2002 by our

method and 11 Random seeds selected from UK-2005

Another aspect where improvement may be possible is

the implementation of the seeds that are not found in a

new crawl. In our experiments, we have simply ignored

nodes were present in an older graph but not in the newer

graph. This aspect may be possible to be improved by

finding similar nodes in the newer graph.

5. REFERENCES

[1] Gulli, A., and Signorini, A. The Indexable Web is More

than 11.5 billion pages. WWW (Special interest and tracks

and posters), (May. 2005), 902-903.

[2] Brin , S. and Page, L. The anatomy of a large-scale

hypertextual Web search engine. Proceedings of the

seventh international conference on World Wide Web 7,

Brisbane, Australia, 1998, 107 – 117.

[3] Henzinger, M. R. Algorithmic challenges in Web Search

Engines. Internet Mathematics,Volume 1, Number 1,

2003, 115-123.

[4] Cho,J. Garcia-Molina, H. and Page, L. Efficient Crawling

through URL ordering. In Proceedings of the 7th

International World Wide Web Conference, pages 161-

172, Brisbane, Australia, April 1998. Elsevier Science

[5] Najork, Wiener, J. L. Breadth-First Search Crawling

Yields High-Quality Pages, Proceedings of the 10th

international conference on World Wide Web WWW '01,

2001.

[6] Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar

Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew

Tomkins, Janet L. Wiener: Graph structure in the Web.

Computer Networks 33(1-6): 309-320 (2000)

[7] Jon M. Kleinberg: Authoritative Sources in a Hyperlinked

Environment. J. ACM 46(5): 604-632 (1999)

[8] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan,

Andrew Tomkins: Trawling the Web for Emerging Cyber-

Communities. Computer Networks 31(11-16): 1481-1493

(1999)

[9] Gary William Flake, Steve Lawrence, C. Lee Giles, Frans

Coetzee: Self-Organization and Identification of Web

Communities. IEEE Computer 35(3): 66-71 (2002)

[10] J. Kleinberg, S. Lawrence. The Structure of the Web.

Science 294(2001), 1849.

[11] Laboratory for Web Algorithmics, http://law.dsi.unimi.it/

[12] Paolo Boldi and Sebastiano Vigna, The WebGraph

framework I: Compression techniques. In Proc. of the

Thirteenth International World Wide Web Conference

(WWW 2004), pages 595-601, Manhattan, USA, 2004.

ACM Press.

[13] Boldi, P., Codenotti, B., Santini, M., Vigna, S.,

UbiCrawler: A Scalable Fully Distributed Web Crawler,

Journal of Software: Practice & Experience, 2004, volume

34, number 8, pages 711—726.

[14] Albert, R. Jeong, H. Barabasi, A.L. A random Graph

Model for massive graphs, ACM symposium on the

Theory and computing 2000.

Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İncelemeSeed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive Review

Article

Full-text available

Nov 2022

Web, İnternet üzerinde yayınlanan çeşitli türden bilgilerin bulunduğu bir veri deposudur. Bu bilgileri üzerinde bulunduran ve birbirlerine köprülerle bağlı olan yapılara web sayfaları denir. Web tarayıcıları, web sayfaları üzerindeki köprüleri kullanarak Web’i tarayan ve sayfaları indiren programlardır. Bir arama motorunun performansı da web tarayıcısının performansına bağlıdır. Web tarayıcılarının performans metrikleri, kapsamı ve tohum URL seçim yöntemleri performansı etkileyen en önemli faktörlerdir. Bu çalışmada, genel, odaklanmış, artırılmış, gizli, mobil ve dağıtılmış olmak üzere altı kategoride sınıflandırdığımız web tarayıcılarının performansları, kapsamları ve tohum URL kullanım yöntemleri hakkında kapsamlı bir inceleme ve analiz yapılmıştır. Ayrıca her bir tarayıcının çeşitli çalışmalarda yapılmış performans ölçütleri karşılaştırılmıştır.

Web Tarayıcıları için Etkili Tohum URL Seçimi ve Kapsam Genişletme AlgoritmasıEffective Seed URL Selection and Scope Extension Algorithm for Web Crawler

Article

Full-text available

Mar 2023

Web, hızla büyüyen ve her türden verilerin bulunduğu devasa bir veri kaynağıdır. Kullanıcılar bu veri kaynağından istedikleri verileri almak için arama motorlarını kullanırlar. Arama motorları bu verileri web tarayıcıları ile elde ederler. Web tarayıcıları web sayfalarındaki tek düzen kaynak bulucuları (URL-Uniform Resource Locator) izleyerek ulaştıkları tüm sayfalardaki verileri alır, ayrıştırır ve indekslerler. Web tarama sürecindeki en önemli konular hangi URL’lerden başlanacağı ve taramanın kapsamıdır. Bu yazıda kapsamı tüm web olan genel bir tarayıcının tohum URL seçim ve kapsam genişletme yöntemleri sunulmuştur. Tohum URL seçiminde 102 farklı ülkede ziyaretçinin günlük harcadığı saat, ziyaretçi başına günlük sayfa görüntüleme sayısı, aramadan gelen trafiğin yüzdesi ve toplam bağlı site sayısı temel alınarak oluşturulmuş üç farklı tohum URL seti oluşturulup detaylı bir şekilde performansları analiz edilmiştir. Ayrıca kapsamı hızlı bir şekilde genişletmek için link skoruna dayalı yeni bir tarama algoritması önerilmiş, tohum URL setleri kullanılarak taramalar yapılmış, karşılaştırılmış ve detaylı analizleri yapılmıştır.

A General Evaluation Framework for Adaptive Focused Crawlers

Conference Paper

Full-text available

Apr 2014

Focused crawling is increasingly seen as a solution to increase the freshness and coverage of local repository of documents related to specific topics by selectively traversing paths on the web. The adaptation is a peculiar feature that makes it possible to modify the search strategies according to the particular environment, its alterations and its relationships with the given input parameters during the search. This paper introduces a general evaluation framework for adaptive focused crawlers.

The Method of Improving the Specific Language Focused Crawler

Article

Full-text available

In recent years, more and more CJK (Chinese, Japanese, and Korean) web pages appear in the Internet. The infor-mation in the CJK web page also be-comes more and more important. Web crawler is a kind of tool to retrieve web pages. Previous researches focused on English web crawlers and the web crawler is always optimized for English web pages. We found that the perform-ance of the web crawler is worse in re-trieving CJK web pages. We tried to en-hance the performance of the CJK crawler by analyzing the web link struc-ture, anchor text, and host name on the hyperlink and changing the crawling al-gorithm. We distinguish the top-level domain name and the language of the anchor text on hyperlinks. The method that distinguishes the language of the an-chor text on hyperlinks is not used on CJK language specific crawler by other researches. Control experiment is used in this research. According to the experi-mental results, when the target crawling language is Japanese, the 87% of the crawled web pages are Japanese web pages and improves the efficiency about 0.24% compares to the baseline results. When the target crawling language is Chinese, the 88% of the crawled web pages are Chinese web pages and im-proves the efficiency about 0.07% com-pares to the baseline results. When the target crawling language is Korean, the 71% of the crawled web pages are Ko-rean web pages and improves the effi-ciency about 10% compares to the baseline results.

Effective Performance of Information Retrieval on Web by Using Web Crawling

Article

Full-text available

May 2012

World Wide Web consists of more than 50 billion pages online. It is highly dynamic i.e. the web continuously introduces new capabilities and attracts many people. Due to this explosion in size, the effective information retrieval system or search engine can be used to access the information. In this paper we have proposed the EPOW (Effective Performance of WebCrawler) architecture. It is a software agent whose main objective is to minimize the overload of a user locating needed information. We have designed the web crawler by considering the parallelization policy. Since our EPOW crawler has a highly optimized system it can download a large number of pages per second while being robust against crashes. We have also proposed to use the data structure concepts for implementation of scheduler & circular Queue to improve the performance of our web crawler.

Crawling Deep Web Data Based on Three-stage Template

Conference Paper

Mar 2022

Focused crawling application for building corporate knowledge base

Conference Paper

Jun 2021

A New SOA Security Model to Protect Against Web Competitive Intelligence Attacks by Software Agents

Chapter

Jan 2011

This article presents an automata SOA based security model against competitive intelligence attacks in e-commerce. It focuses on how to prevent conceptual interception of an e-firm business model from CI agent attackers. Since competitive intelligence web environment is a new important approach for all e-commerce based firms, they try to come in new marketplaces and need to find a good customer-base in contest with other existing competitors. Many of the newest methods for CI attacks in web position are based on software agent facilities. Many researchers are currently working on how to facilitate CI creation in this environment. The aim of this paper is to help e-firm designers provide a non-predictable presentation layer against CI attacks.

Research of focused crawler for financial social network

Conference Paper

Dec 2016

Xiaotian Diao

Estimating spatial efficiency using cyber search, GIS, and spatial optimization: a case study of fire service deployment in Los Angeles County

Article

Sep 2015

The efficiency of public investments and services has been of interest to geographic researchers for several decades. While in the private sector inefficiency often leads to higher prices, loss of competitiveness, and loss of business, in the public sector inefficiency in service provision does not necessarily lead to immediate changes. In many cases, it is not an entirely easy task to analyze a particular service as appropriate data may be difficult to obtain and hidden in detailed budgets. In this paper, we develop an integrative approach that uses cyber search, Geographic Information System (GIS), and spatial optimization to estimate the spatial efficiency of fire protection services in Los Angeles (LA) County. We develop a cyber-search process to identify current deployment patterns of fire stations across the major urban region of LA County. We compare the results of our search to existing databases. Using spatial optimization, we estimate the level of deployment that is needed to meet desired coverage levels based upon the location of an ideal fire station pattern, and then compare this ideal level of deployment to the existing system as a means of estimating spatial efficiency. GIS is adopted throughout the paper to simulate the demand locations, to conduct location-based spatial analysis, to visualize fire station data, and to map model simulation results. Finally, we show that the existing system in LA County has considerable room for improvement. The methodology presented in this paper is both novel and groundbreaking, and the automated assessments are readily transferable to other counties and jurisdictions.

Efficient Crawling Through URL Ordering

Article

Full-text available

Apr 1998
COMPUT NETW

In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.

Self-Organization and Identification of Web Communities

Article

Full-text available

Apr 2002

The vast improvement in information access is not the only advantage resulting from the increasing percentage of hyperlinked human knowledge available on the Web. Additionally, much potential exists for analyzing interests and relationships within science and society. However, the Web's decentralized and unorganized nature hampers content analysis. Millions of individuals operating independently and having a variety of backgrounds, knowledge, goals and cultures author the information on the Web. Despite the Web's decentralized, unorganized, and heterogeneous nature, our work shows that the Web self-organizes and its link structure allows efficient identification of communities. This self-organization is significant because no central authority or process governs the formation and structure of hyperlinks

The Structure of the Web

Article

Jan 2001

Authoritative Sources in a Hyperlinked Environment

Article

Jan 1999

Jon Kleinberg

The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of "authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of "hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.

Graph Structure in the Web

Conference Paper

Jun 2000
COMPUT NETW

The study of the Web as a graph is not only fascinating in its own right, but also yields valuable insight into Web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the Web graph using two AltaVista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the Web is considerably more intricate than suggested by earlier experiments on a smaller scale.

Trawling the Web for emerging cyber-communities

Article

May 1999
COMPUT NETW

The Web harbors a large number of communities — groups of content-creators sharing a common interest — each of which manifests itself as a set of interlinked Web pages. Newgroups and commercial Web directories together contain of the order of 20,000 such communities; our particular interest here is on emerging communities — those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a Web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment.

Algorithmic Challenges in Web Search Engines

Article

Jan 2003

Monika Henzinger

In this paper, we describe six algorithmic problems that arise in web search engines and that are not or only partially solved: (1) Uniformly sampling of web pages; (2) modeling the web graph; (3) finding duplicate hosts; (4) finding top gainers and losers in data streams; (5) finding large dense bipartite graphs; and (6) understanding how eigenvectors partition the web.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Conference Paper

Nov 1998
COMPUT NETW

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.

The WebGraph Framework I: Compression Techniques

Article

Apr 2004

Studying Web graphs is often difficult due to their large size. Recently, several proposals have been published about various techniques that allow to store a Web graph in memory in a limited space, exploiting the inner redundancies of the Web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large Web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3.08 bits per link, and its transposed version in as little as 2.89 bits per link.

UbiCrawler: A Scalable Fully Distributed Web Crawler

Article

Jul 2004
SOFTWARE PRACT EXPER

We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.

A Fast Community Based Algorithm for Generating Web Crawler Seeds Set.

Abstract and Figures

Recommended publications

Transformational Methodology

Community finding within the community set space

ComBIM: A Community-Based Solution Approach for the Budgeted Influence Maximization Problem

The semigroup of combinatorial configurations