Figure 1 - uploaded by Sonali Gupta
Content may be subject to copyright.
General Search Engine  

General Search Engine  

Source publication
Article
Full-text available
With the ever proliferating size and scale of the WWW [1] efficient ways of exploring content are of increasing importance. How can we efficiently retrieve information from it through crawling? And in this era of tera and multi-core processors, we ought to think of multi-threaded processes as a serving solution. So, even better how can we improve t...

Citations

Conference Paper
With the rapid development of Internet, the web crawler has become one of the key technologies for users to automatically obtain information from designated sites. The traditional web crawler technology has exposed several problems, such as low content accuracy due to simple filtering conditions with respect to crawling themes, low efficiency due to content duplication and long webpage update time. Aiming at solving these problems, we propose the TBPR (Theme weight and Bayesian Page Rank based crawler) approach by adopting a multi-queue model to achieve high efficiency and reduce content redundancy. Further, TBPR introduces a theme weights model to accurately classify web pages into user’s crawl concept and a Bayesian Page Rank model containing two novel factors to increase content accuracy. Our experiment applies TBPR to real world web contents, demonstrating its accuracy and efficiency.