A sample paper -main title, reference titles and positive examples

Source publication

Panorama: extending digital libraries with topical crawlers

Conference Paper

Full-text available

Jul 2004

A large amount of research, technical and professional documents are available today in digital formats. Digital libraries are created to facilitate search and retrieval of information supplied by the documents. These libraries may span an entire area of interest (e.g., computer science) or be limited to documents within a small organization. While...

Context 1

... accumulate the results (URLs) for each of the titles and the corresponding pages make up a positive example (training) set. Table 1 shows an instance of titles extracted from a paper and the corresponding positive examples. ...

View in full-text

euDML Visibility to Google Free-form Searching-a Technical Report A technical report on the accessibility of documents in the euDML to the Google crawler when using free-form searching

Technical Report

Full-text available

Jan 2013

Melius Weideman

A digital library user might not enter its webpages through the homepage menus – many are using standard search engines as opposed to academic databases to find academic information. Digital library owners should therefore not only consider the menus and/or information architecture as seen from the homepage. The aim of this research project was to...

The BINGO! System for Information Portal Generation and Expert Web Search.

Conference Paper

Full-text available

Jan 2003

This paper presents the BINGO! focused crawler, an advanced tool for information por- tal generation and expert Web search. In contrast to standard search engines such as Google which are solely based on precomputed index structures, a focused crawler interleaves crawling, automatic classification, link analy- sis and assessment, and text filtering...

SpidersRUs: Automated development of vertical search engines in different domains and languages

Conference Paper

Full-text available

Jun 2005

In this paper we discuss the architecture of a tool designed to help users develop vertical search engines in different domains and different languages. The design of the tool is presented and an evaluation study was conducted, showing that the system is easier to use than other existing tools.

Machine Learning and Security Applications in Digital Library

Article

Full-text available

Nov 2019

This paper presents the applications of machine learning machine learning applications in the digital library. Using machine learning it is possible to search and retrieve non-textual information. The paper also discusses the machine learning applications in security aspects. A systematic review of literature is also done and with the help of citation mapping in Web of Science citation network analysis is presented

A Focused Crawler Combinatory Link and Content Model Based on T-Graph Principles

Article

Jul 2015
COMPUT STAND INTER

Gathering Alumni Information from a Web Social Network

Conference Paper

Full-text available

Oct 2014

An undergraduate program must prepare its students for the major needs of the labor market. One of the main ways to identify what are the demands to be met is creating a manner to manage information of its alumni. This consists of gathering data from program's alumni and finding out what are their main areas of employment on the labor market or which are their main fields of research in the academy. Usually, this data is obtained through available forms on the Web or forwarded by mail or email; however, these methods, in addition to being laborious, do not present good feedback from the alumni. Thus, this work proposes a novel method to help teaching staffs of undergraduate programs to gather information on the desired population of alumni, semi-automatically, on the Web. Overall, by using a few alumni pages as an initial set of sample pages, the proposed method was capable of gathering information concerning a number of alumni twice as bigger than adopted conventional methods.

An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery

Conference Paper

Full-text available

Nov 2012

The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web documents. Starting from a simple topic query, a set of focused crawler agents explore in parallel topic-specific web paths using dynamic seed URLs that belong to certain web genres and are collected from web search engines. The agents make use of an internal mechanism that weighs topic and genre relevance scores of unvisited web pages. They are able to adapt to the properties of a given topic by modifying their internal knowledge during search, handle ambiguous queries, ignore irrelevant pages with respect to the topic and retrieve collaboratively topic-relevant web pages. We performed an experimental study to evaluate the behavior of the agents for a variety of topic queries demonstrating the benefits and the capabilities of our framework.

Improving the Efficiency of a Genre-Aware Approach to Focused Crawling Based on Link Context

Conference Paper

Full-text available

Oct 2012

Focused crawlers attempt to crawl web pages that are relevant to a specific topic or user interest. Although these kinds of crawlers have been proven to be effective, they need to improve their efficiency. Focused crawlers usually use a Frontier of non-visited URLs to visit the web pages and gather relavant ones. In this work, we define and evaluate a queueing policy of non-visited URLs, based on link context, to improve the efficiency of a genre-aware focused crawler. Our experimental evaluation shows, in some situations, an improvement around 100% in efficiency terms.

Using wavelet analysis for text categorization in digital libraries: A first experiment with Strathprints

Article

Full-text available

Jul 2012
Int J Digit Libr

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories. KeywordsDigital libraries–Text categorization–Machine learning–Support vector machines–Analogical information representation–Wavelet analysis

Improving the Performance of Focused Web Crawlers

Article

Full-text available

Oct 2009
DATA KNOWL ENG

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.

A Genre-Aware Approach to Focused Crawling

Article

Full-text available

Sep 2009
WORLD WIDE WEB

Focused crawlers have as their main goal to crawl pages that are relevant to a specific topic or user interest, playing an important role for a great variety of applications. In general, they work by trying to find and crawl all kinds of pages deemed as related to an implicitly declared topic. However, users are often not simply interested in any document about a topic, but instead they may want only documents of a given type or genre on that topic to be retrieved. In this paper, we describe an approach to focused crawling that exploits not only content-related information but also genre information present in Web pages to guide the crawling process. This approach has been designed to address situations in which the specific topic of interest can be expressed by specifying two sets of terms, the first describing genre aspects of the desired pages and the second related to the subject or content of these pages, thus requiring no training or any kind of preprocessing. The effectiveness, efficiency and scalability of the proposed approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi of computer science courses, job offers in the computer science field and sale offers of computer equipments. These experiments show that focused crawlers constructed according to our genre-aware approach achieve levels of F1 superior to 88\%, requiring the analysis of no more than 65\% of the visited pages in order to find 90\% of the relevant pages. In addition, we experimentally analyze the impact of term selection on our approach and evaluate a proposed strategy for semi-automatic generation of such terms. This analysis shows that a small set of terms selected by an expert or a set of terms specified by a typical user familiar with the topic is usually enough to produce good results and that such a semi-automatic strategy is very effective in supporting the task of selecting the sets of terms required to guide a crawling process.

Correlation of Expert and Search Engine Rankings

Article

Full-text available

Oct 2008

In previous research it has been shown that link-based web page metrics can be used to predict experts' assessment of quality. We are interested in a related question: do expert rankings of real-world entities correlate with search engine rankings of corresponding web resources? For example, each year US News & World Report publishes a list of (among others) top 50 graduate business schools. Does their expert ranking correlate with the search engine ranking of the URLs of those business schools? To answer this question we conducted 9 experiments using 8 expert rankings on a range of academic, athletic, financial and popular culture topics. We compared the expert rankings with the rankings in Google, Live Search (formerly MSN) and Yahoo (with list lengths of 10, 25, and 50). In 57 search engine vs. expert comparisons, only 1 strong and 4 moderate correlations were statistically significant. In 42 inter-search engine comparisons, only 2 strong and 4 moderate correlations were statistically significant. The correlations appeared to decrease with the size of the lists: the 3 strong correlations were for lists of 10, the 8 moderate correlations were for lists of 25, and no correlations were found for lists of 50.

Enhancing digital libraries using missing content analysis

Conference Paper

Jun 2008

This work shows how the content of a digital library can be enhanced to better satisfy its users' needs. Missing content is identified by finding missing content topics in the system's query log or in a pre-defined taxonomy of required knowledge. The collection is then enhanced with new relevant knowledge, which is extracted from external sources that satisfy those missing content topics. Experiments we conducted measure the precision of the system before and after content enhancement. The results demonstrate a significant improvement in the system effectiveness as a result of content enhancement and the superiority of the missing content enhancement policy over several other possible policies.

A sample paper -main title, reference titles and positive examples

Context in source publication

Similar publications

Citations