General Search Engine

Source publication

WebParF:A Web Partitioning Framework for Parallel Crawler

Article

Full-text available

Jun 2014

With the ever proliferating size and scale of the WWW [1] efficient ways of exploring content are of increasing importance. How can we efficiently retrieve information from it through crawling? And in this era of tera and multi-core processors, we ought to think of multi-threaded processes as a serving solution. So, even better how can we improve t...

Crawling and cluster hidden web using crawler framework and fuzzy-KNN

Conference Paper

Aug 2017

Towards Intelligent Web Crawling – A Theme Weight and Bayesian Page Rank Based Approach

Conference Paper

Oct 2017
Lect Notes Comput Sci

With the rapid development of Internet, the web crawler has become one of the key technologies for users to automatically obtain information from designated sites. The traditional web crawler technology has exposed several problems, such as low content accuracy due to simple filtering conditions with respect to crawling themes, low efficiency due to content duplication and long webpage update time. Aiming at solving these problems, we propose the TBPR (Theme weight and Bayesian Page Rank based crawler) approach by adopting a multi-queue model to achieve high efficiency and reduce content redundancy. Further, TBPR introduces a theme weights model to accurately classify web pages into user’s crawl concept and a Bayesian Page Rank model containing two novel factors to increase content accuracy. Our experiment applies TBPR to real world web contents, demonstrating its accuracy and efficiency.

General Search Engine

Citations