ArticlePDF Available

Framework for Green Search Engine Design

July 2015
International Journal of Computer Applications NCKITE 2015(3):25-28

July 2015
NCKITE 2015(3):25-28

Authors:

Guru Ghasidas University

Traditional search engines use a thin client, distributed model for crawling. This crawler based approach has certain drawbacks which could be removed with a proposed rich client based model. The rich client based search engine offers faster crawling and better updation time using lesser resources than thin client model, and it covers more of the World Wide Web than normal crawler based search engines. Although modern day search engine giants have improvised on various features such as ergonomics and utilities, along with several added goodies, little work is done to improve energy efficiency of such Large Scale Search Engines. As the Internet is increasing exponentially the search engines will involve more and more servers thus costing more and more energy. This ever increasing demand of search engines needs to be curbed down. Rather than multiplying server resources it is better to use existing servers which work in a congenial environment, using communication methods to reduce redundant downloading of data from different servers by the crawlers.This paper proposes a rich client based architecture for search engines along with analysis and comparison with present search engines. This could help into reducing the challenges of global warming, keeping up the speed and efficiency requirements. General Terms Search engine optimization.

Content uploaded by Kapil Kumar Nagwanshi

Content may be subject to copyright.

International Journal of Computer Applications (0975 – 8887)

National Conference on Knowledge, Innovation in Technology and Engineering (NCKITE 2015)

Framework for Green Search Engine Design

Kapil Kumar Nagwanshi

RCET Bhilai

Praval Kumar Jha

HCL Technologies Bangalore

Sipi Dubey

RCET Bhilai

ABSTRACT

Traditional search engines use a thin client, distributed model

for crawling. This crawler based approach has certain

drawbacks which could be removed with a proposed rich

client based model. The rich client based search engine offers

faster crawling and better updation time using lesser resources

than thin client model, and it covers more of the World Wide

Web than normal crawler based search engines. Although

modern day search engine giants have improvised on various

features such as ergonomics and utilities, along with several

added goodies, little work is done to improve energy

efficiency of such Large Scale Search Engines. As the Internet

is increasing exponentially the search engines will involve

more and more servers thus costing more and more energy.

This ever increasing demand of search engines needs to be

curbed down. Rather than multiplying server resources it is

better to use existing servers which work in a congenial

environment, using communication methods to reduce

redundant downloading of data from different servers by the

crawlers.This paper proposes a rich client based architecture

for search engines along with analysis and comparison with

present search engines. This could help into reducing the

challenges of global warming, keeping up the speed and

efficiency requirements.

General Terms

Search engine optimization.

Keywords

Search engines, thick client, rich client, updation delay, and

crawler.

1. INTRODUCTION

Typically search engines use a crawler to crawl through

URL’s, extract links, and index mined data from the web-

pages to build a web repository [1]. Huge Search engines

maintain a gigantic web repository of web pages. This

repository is indexed with help of data mining tools for faster

searches. The crawlers reside on high performance servers

spread throughout the world and crawling the web 24x7.

When they find an un-indexed page they store it in repository

and index it. Whereas when a web page is revisited during

crawling, if the content is changed an update is made and the

old page in repository is replaced by new page. Commercial

search engines have periodic update policies for maintaining a

relevant web repository [2].

The update frequency of search engine database for a

particular site depends on number of parameters such as, (i)

priorities decided by search engine, (ii) page rank of the site,

(iii) frequency of occurrence in queue, and (iv) availability of

the site on the WWW. The search engine may decide its

priority for updating different sites which may be based on

diverse criteria such as, (i) how popular the website is, (ii)

what is the size of website, (iii) how often does site content

transforms, and (iv) how many links point to that particular

web site (which partially depends on the page rank) [6] The

content change frequency and page-rank are used to schedule

updates. Content changes are observed and an optimum

refresh policy is obtained by ignoring too frequently changing

pages [7].

Fig. 1. Change frequency vs refresh frequency for

freshness optimization

Fig. 2. Study result on URL Ordering with various

methods of scheduling [10].

1.1 Traditional Search Engines

The drawbacks of traditional search engines includes (i)

update delays, (ii) wastage of Storage space, (iii) incomplete

data & less coverage, (iv) performance impacts on websites,

and (v) contribution to global warming.

1.1.1 Update delays

The update delay is a major drawback to the traditional search

engines. The update might be either steady or periodic [5]; in

case of steady crawling we have to wait for a changed page to

come again in the crawl path to be updated in the repository.

Steady crawlers use page-rank to prioritize crawling. Periodic

crawling is used along with change frequency estimation

0.2

0.4

0.6

0.8

1.2

1.4

1.6

0 1 2 3 4 5

Refresh frequency f

Change frequency x

International Journal of Computer Applications (0975 – 8887)

National Conference on Knowledge, Innovation in Technology and Engineering (NCKITE 2015)

details to schedule updating. Even then if a page is updated

more than once between two updates only latest page is

copied to repository [6]. And if a page change misses an

update interval it will have to wait for more time.

1.1.2 Wastage of Storage Space

The pages crawled by the search engine are downloaded and

saved into the web repository, and then indexed [2]. If the

data already exists online what is the need to save it into

repositories maintained by search engines doing this we are

unknowingly wasting storage resources. A large scale search

engines need not save all pages into its repository; but only

abstract of the pages will do work.

1.1.3 Incomplete data & less coverage

Only very small no. of web pages is indexed amounting about

6-12 % [2]. This is due to following reasons:

 Search engines have limited bandwidth

 Crawlers may not reach important pages

because they are too far.

 Depth of crawl is fixed, and page is located too

deep.

Fig. 3. Invisible web

The part of World Wide Web which is not restricted by

dynamic content is also amalgamated into deep web due to

inability of the crawler to find a link to it from the set of page

it is crawling. This may be visualized from given Fig. 1.

1.1.4 Performance impact on websites

When the crawler revisits a website to gain updates, the web

server uses its network bandwidth to serve the crawler; also it

has to spend its CPU time and memory to entertain crawlers

[5]. These resources could have been better utilized to serve

users by the web server. All above resources are wasted if the

site does not change frequently.

1.1.5 Contribution to global warming

When the crawlers are indefinitely crawling and downloading

web pages a lot of energy is wasted in fetching of web pages

from the World Wide Web as it involves transmission of

packets all over the route from source to destination (i.e. Our

Server)[7]. The pages are completely downloaded even when

they are of the same date as of one we have in our repository.

[8] , [9].

2. MATHEMATICAL MODELLING

A Page-rank is calculated based on number of pages that point

to it. This is actually a measure based on the number of back-

links to a page. A back-link is a link pointing to a page rather

than pointing out from it. Measure is not purely a count of

number of back-links because a weighting is used to provide

more importance to back-links coming from important pages.

Given a page p, it uses Bp to be the set of pages that point to

p, and Fp be the links set out of p. Page-rank [3] of page p is

defined as eq. (1)

cpR 





||

)(

*)(

(1)

Here c is a constant used for normalization whose value is

between 0 and 1.

A problem called rank sink that exists with above page-rank

calculation causes R value to increase for a cyclic reference. It

is removed by adding additional term in eq. (1) and improved

page rank R’ is obtained as eq.(2)

vEc

cpR 



 ||

)(*

)(

*)('

(2)

Here c is maximized, E(v) is a vector that adds an artificial

link. This simulates a random surfer who periodically decides

to stop following links and jumps to a new page. E(v) adds

links of small probabilities between every pair of nodes. This

page-rank calculation takes place after the firing of a query,

and sorted results are displayed with pages with highest page

ranks first [4]. The update is scheduled after observing the

frequency by which a site content changes. The pages

changed daily are updated daily, while weekly changing sites

are updated once in a week.

2.1 Reduced Time Complexity

The time required by the traditional crawlers to crawl a

particular website will be as given below.











 n

T*)(

(3)

Where

np is the no. of pages

k is a constant defined by the techniques used for

concurrency.

Sp is the average web page size

B is the bandwidth of the channel

Dn is the network delay involved.

Our time required to crawl the given website will be

T )(*)('

(4)

Here Bf = bandwidth of local browsing

10

Since the XML sitemap file size increases with no. of pages

using our technique, the Dn should be changed with the

increase in size of file transferred. But for ease of simplicity we

can omit this part, as there is no significant change in our

comparison.

Comparing these two equations we see that the Network delay

is multiplied each time, so for a practical channel the time

required by a traditional search engine will be much higher

than our proposed crawler.

Invisible Web

International Journal of Computer Applications (0975 – 8887)

National Conference on Knowledge, Innovation in Technology and Engineering (NCKITE 2015)

The crawling of the site is said to be fruitful only if the content

has changed, therefore the equations 3 & 4 need to be modified

to find fruitful time aseq (5) and (6)-



































 t

ChangeP

ChangePD



)(

)(**)(

(5)

Where λ is frequency of change and n is the no. of arrivals

(change events).

Assuming the change of page content is a Poisson process [6]

the crawler will be effective only when the crawling is

scheduled so that P (Change) is maximum which is exactly at

λ.































 1)()(** ChangePChangePD

(6)

P (change) in above equation become = 1, as the crawl event is

fired by a change of content. These equations can be plot, and

then the area covered by each curve gives the successful effort

applied by each of them. The traditional crawler’s performance

varies with probability of change.

Fig. 1.Time Expenditure Comparison

2.2 Space Complexity

Our sitemap reduces the text content of the website by

abstraction. Thus the space complexity will be reduced, the

sitemap may be further space optimized by using text

compression for large websites.

2.3 Lesser power consumption

While the traditional steady crawlers continuously crawls the

web, further more continuous running of multiple servers too

affects [6]. Using our architecture the updates become event

fired and are analogous to Poisson’s process [5]. Thus

unnecessary wandering of crawlers, which wastes power, is

removed. [8] , [9] .

Fig. 2.Change in probability function with maxima at λ [6]

2.4 Better Coverage

The Deep web problem will be solved using this architecture.

As each crawl is limited to a local resource, broken links will

be less frequent. For dynamic contents the site administrator

may provide URL to focus crawler on certain pages he wants

to be indexed in the search engine.

3. CONCLUSIONS

Practicality of above architecture can be checked by a parallel

run with traditional search engine. A multithreading enabled

language should be used with the user agents; we suggest that

Java should be the language for its implementation. With Java

the user agent will become platform independent and could

run on any type of web-servers. Database may be clustered for

better performance and access time.

4. ACKNOWLEDGMENTS

Our sincere thanks to the RCET-Bhilai management for

providing us essential infrastructural requirements and

supports.

5. REFERENCES

[1] Guckian K., 2011. “Internet 202: The Invisible Web -

Beyond Google”,

http://www.ire.org/resourcecenter/viewtipsheets.php?nu

mber=2347.

[2] Brin, S. & Page, L., 1998. “The anatomy of a large-scale

hypertextual Web search engine”, Computer networks

and ISDN systems, Elsevier, Vol. 30(1-7), pp. 107-117.

[3] Dunham, M.,2003. “Data Mining: Introductory And

Advanced Topics”, 1/e , Pearson Education,.

[4] Brandman, O., Cho, J., Garcia-Molina, H. &Shivakumar,

N., 2000. “Crawler-friendly web servers” , ACM

SIGMETRICS Performance Evaluation Review, ACM,

Vol. 28(2), pp. 14.

[5] Cho, J. & Garcia-Molina, H.,2003. “Estimating

frequency of change” , ACM Transactions on Internet

Technology (TOIT), ACM New York, NY, USA, Vol.

3(3), pp. 256-29.

[6] Boswell, D.,2003. “Distributed high-performance Web

crawlers: a survey of the state of the art”, Department of

Electrical & Computer Engineering, University of

California, San Diego, Citeseer.

[7] Wissner Gross, 2009. “A. Environmental Footprint

Monitor For Computer Networks” , WO Patent

WO/2009/076,667.

Tp’

No. of pages (Np)

Time

International Journal of Computer Applications (0975 – 8887)

National Conference on Knowledge, Innovation in Technology and Engineering (NCKITE 2015)

[8] Connolly, M. ,2009. ”What Does A Google Search

Really Cost?”, Association of Small Computer Users in

Education “Our Second Quarter Century of Resource

Sharing”, pp. 84.

[9] Junghoo, C., Garcia-Molina, H. & Page, L. ,1998.

“Efficient crawling through URL ordering“ , Computer

Networks and ISDN Systems, Elsevier, Vol. 30(1), pp.

161-172.

[10] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A.

&Raghavan, S. , 2001. “Searching the web” , ACM

Transactions on Internet Technology (TOIT), ACM, Vol.

1(1), pp. 43.

IJCATM : www.ijcaonline.org

ResearchGate has not been able to resolve any citations for this publication.

Data Mining- Introductory and Advanced Topics

Book

Full-text available

Jan 2006

Introduction Introduction Related Concepts Data Mining Techniques Core Topics Classification Clustering Association Rules Advanced Topics Web Mining Spatial Mining Temporal Mining Appendix Index Salient Features Covers advanced topics such as Web Mining and Spatial/Temporal Mining. Includes succinct coverage of Data Warehousing, OLAP, Multidimensional Data, and Preprocessing. Concise coverage on distributed, parallel, and incremental algorithms. Provides case studies. Offers clearly written algorithms to better understand techniques. Algorithms are presented in a pseudocode. Includes a reference on how to use Prototypes and DM products.

Efficient Crawling Through URL Ordering

Article

Full-text available

Apr 1998
COMPUT NETW

In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.

Searching the Web

Article

Full-text available

Aug 2002

We o#er an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. We draw for this presentation from the literature, and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts, and the results of several performance analyses we conducted to compare di#erent designs.

Distributed High-performance Web Crawlers: A Survey of the State of the Art

Article

Dustin Boswell

Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard drive for local viewing. For such small-scale tasks, numerous util-ities like wget exist. In fact, an entire web crawler can be written in 20 lines of Python code. Indeed, the task is inherently simple: the general algorithm is shown in figure 1.However, if one needs a large portion of the web (eg. Google currently indexes over 3 billion web pages), the task becomes astoundingly di cult.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Article

Apr 1998

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.

Estimating Frequency of Change

Article

Mar 2000

Many online data sources are updated autonomously and independently. In this paper, we make the case for estimating the change frequency of the data, to improve web crawlers, web caches and to help data mining. We first identify various scenarios, where different applications have different requirements on the accuracy of the estimated frequency. Then we develop several "frequency estimators" for the identified scenarios. In developing the estimators, we analytically show how precise/effective the estimators are, and we show that the estimators that we propose can improve precision significantly. 1 Introduction With the explosive growth of the internet, many data sources are available online. Most of the data sources are autonomous and are updated independently of the clients that access the sources. For instance, popular news web sites, such as CNN and NY Times, update their contents periodically, whenever there are new developments. Also, many online stores update the price/availab...

Crawler-Friendly Web Servers

Article

Jun 2000
Perform Eval Rev

In this paper we study how to make web servers (e.g., Apache) more crawler friendly. Current web servers o#er the same interface to crawlers and regular web surfers, even though crawlers and surfers have very di#erent performance requirements. We evaluate simple and easy-to-incorporate modifications to web servers so that there are significant bandwidth savings. Specifically, we propose that web servers export meta-data archives decribing their content. 1

Jan 2011

K Guckian

Guckian K., 2011. "Internet 202: The Invisible Web -Beyond Google", http://www.ire.org/resourcecenter/viewtipsheets.php?nu mber=2347.

A. Environmental Footprint Monitor For Computer Networks

Jan 2009
667

Wissner Gross

Wissner Gross, 2009. "A. Environmental Footprint Monitor For Computer Networks", WO Patent WO/2009/076,667.

Association of Small Computer Users in Education "Our Second Quarter Century of Resource Sharing

Jan 2009
84

M Connolly

Connolly, M.,2009. "What Does A Google Search Really Cost?", Association of Small Computer Users in Education "Our Second Quarter Century of Resource Sharing", pp. 84.

Framework for Green Search Engine Design

Abstract

Recommended publications

Multispectral Human Footprint Dataset (Kapil Nagwanshi and Sipi Dubey)

Biometric Authentication using Human Footprint

Statistical Feature Analysis of Human Footprint for Personal Identification Using BigML and IBM Wats...

Mathematical Modeling of Footprint Based Biometric Recognition