Figure 1 - uploaded by Sonali Gupta
Content may be subject to copyright.
Source publication
With the ever proliferating size and scale of the WWW [1] efficient ways of
exploring content are of increasing importance. How can we efficiently retrieve
information from it through crawling? And in this era of tera and multi-core
processors, we ought to think of multi-threaded processes as a serving
solution. So, even better how can we improve t...
Citations
With the rapid development of Internet, the web crawler has become one of the key technologies for users to automatically obtain information from designated sites. The traditional web crawler technology has exposed several problems, such as low content accuracy due to simple filtering conditions with respect to crawling themes, low efficiency due to content duplication and long webpage update time. Aiming at solving these problems, we propose the TBPR (Theme weight and Bayesian Page Rank based crawler) approach by adopting a multi-queue model to achieve high efficiency and reduce content redundancy. Further, TBPR introduces a theme weights model to accurately classify web pages into user’s crawl concept and a Bayesian Page Rank model containing two novel factors to increase content accuracy. Our experiment applies TBPR to real world web contents, demonstrating its accuracy and efficiency.