Table 1 - uploaded by Kostas Tsioutsiouliklis
Content may be subject to copyright.
A sample paper -main title, reference titles and positive examples

A sample paper -main title, reference titles and positive examples

Source publication
Conference Paper
Full-text available
A large amount of research, technical and professional documents are available today in digital formats. Digital libraries are created to facilitate search and retrieval of information supplied by the documents. These libraries may span an entire area of interest (e.g., computer science) or be limited to documents within a small organization. While...

Context in source publication

Context 1
... accumulate the results (URLs) for each of the titles and the corresponding pages make up a positive example (training) set. Table 1 shows an instance of titles extracted from a paper and the corresponding positive examples. ...

Similar publications

Technical Report
Full-text available
A digital library user might not enter its webpages through the homepage menus – many are using standard search engines as opposed to academic databases to find academic information. Digital library owners should therefore not only consider the menus and/or information architecture as seen from the homepage. The aim of this research project was to...
Conference Paper
Full-text available
This paper presents the BINGO! focused crawler, an advanced tool for information por- tal generation and expert Web search. In contrast to standard search engines such as Google which are solely based on precomputed index structures, a focused crawler interleaves crawling, automatic classification, link analy- sis and assessment, and text filtering...
Conference Paper
Full-text available
In this paper we discuss the architecture of a tool designed to help users develop vertical search engines in different domains and different languages. The design of the tool is presented and an evaluation study was conducted, showing that the system is easier to use than other existing tools.

Citations

... Topical crawlers can be used for web mining and harvesting data. A machine learning technique to harvest information from the web and to process using lexical analysis is proposed in [17]. ...
Article
Full-text available
This paper presents the applications of machine learning machine learning applications in the digital library. Using machine learning it is possible to search and retrieve non-textual information. The paper also discusses the machine learning applications in security aspects. A systematic review of literature is also done and with the help of citation mapping in Web of Science citation network analysis is presented
... Examples from unrelated papers form a negative example set. Both of the two sets are used to train a Naïve Bayes classifier, which is used to guide the crawling process [8]. ...
... The priority score of the fetcher queue items is cyclically incremented to prevent starvation. 8. HTML data of the analyzed Web page is fully stored in the repository along with all the measurements such as the priority scores given to the links. ...
... Focused crawler aims to crawl web pages that are considered relevant to a specific user interest. There exist several works concerning focused crawling, involving proposed heuristics to such end [10], [11] and classification schemes [12]. ...
... As previously mentioned, our proposed method receive as input a list of alumni's names from an undergraduate program. We perform experiments with five alumni lists, available on the web, concerning the following undergraduate programs: Computer Science of the Federal University of Minas Gerais (UFMG) 8 , Metallurgical Engineering of the Federal University of Ouro Preto (UFOP) 9 , Chemistry of the University of So Paulo (USP) 10 , Computer Science of the USP 11 , and Computer Science of the Catholic Pontifical University of Paran (PUC-PR) 12 . Table II shows the number of alumni available in each list. ...
Conference Paper
Full-text available
An undergraduate program must prepare its students for the major needs of the labor market. One of the main ways to identify what are the demands to be met is creating a manner to manage information of its alumni. This consists of gathering data from program's alumni and finding out what are their main areas of employment on the labor market or which are their main fields of research in the academy. Usually, this data is obtained through available forms on the Web or forwarded by mail or email; however, these methods, in addition to being laborious, do not present good feedback from the alumni. Thus, this work proposes a novel method to help teaching staffs of undergraduate programs to gather information on the desired population of alumni, semi-automatically, on the Web. Overall, by using a few alumni pages as an initial set of sample pages, the proposed method was capable of gathering information concerning a number of alumni twice as bigger than adopted conventional methods.
... Another interesting approach that benefits from the best answers returned by a Web search engine is discussed in [16]. Specific information from a set of documents is used to query a Web search engine. ...
Conference Paper
Full-text available
The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web documents. Starting from a simple topic query, a set of focused crawler agents explore in parallel topic-specific web paths using dynamic seed URLs that belong to certain web genres and are collected from web search engines. The agents make use of an internal mechanism that weighs topic and genre relevance scores of unvisited web pages. They are able to adapt to the properties of a given topic by modifying their internal knowledge during search, handle ambiguous queries, ignore irrelevant pages with respect to the topic and retrieve collaboratively topic-relevant web pages. We performed an experimental study to evaluate the behavior of the agents for a variety of topic queries demonstrating the benefits and the capabilities of our framework.
... It involves automatic classification of visited pages into a user or community-specific topic hierarchy. They are becoming important tools to support applications such as specialized Web portals [10], digital libraries [11], online searching [9] and competitive intelligence [12]. Figure 1 details, according to [8], the high-level design of the infrastructure for focused crawlers. ...
Conference Paper
Full-text available
Focused crawlers attempt to crawl web pages that are relevant to a specific topic or user interest. Although these kinds of crawlers have been proven to be effective, they need to improve their efficiency. Focused crawlers usually use a Frontier of non-visited URLs to visit the web pages and gather relavant ones. In this work, we define and evaluate a queueing policy of non-visited URLs, based on link context, to improve the efficiency of a genre-aware focused crawler. Our experimental evaluation shows, in some situations, an improvement around 100% in efficiency terms.
... To apply machine learning (ML) to one of the standard DL circulation activities, namely text categorization [48] , is part of the cognitive toolbox de- ployed [18]. In this context, ML is extensively being experimented with in different development areas and scenarios; to name but a few, for extracting image content from figures in scientific documents for categoriza- tion [33, 34], automatically assessing and characterizing resource quality for educational DL [54, 5], assessing the quality of scientific conferences [37] , web-based collection development [42], automated document metadata extraction by support vector machines (SVM, [24] ), automatic extraction of titles from general documents [27], information architecture [17] , to remove duplicate doc- uments [9], for collaborative filtering [59] , for the automatic expansion of domain-specific lexicons by term categorization [3], for generating visual thesauri [45], or the semantic markup of documents [13]. As part of this direction of research, ML is being tested for its ability to reproduce parts of collections indexed by widespread classification schemes in a supervised learning setting, such as automatic text categorization using the Dewey Decimal Classification (DDC, [52]), or the Library of Congress Classification (LCC) from Library of Congress Subject Headings (LCSH, [20, 43]). ...
Article
Full-text available
Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories. KeywordsDigital libraries–Text categorization–Machine learning–Support vector machines–Analogical information representation–Wavelet analysis
... Their performance depends highly on the selection of good starting pages (seed pages). Typically users provide a set of seed pages as input to a crawler or, alternatively, seed pages are selected among the best answers returned by a Web search engine [32], using the topic as query [15] [16] [31]. Good seed pages can be either pages relevant to the topic or pages from which relevant pages can be accessed within a small number of routing hops. ...
Article
Full-text available
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.
... They have as their main goal to efficiently crawl pages that are, in the best possible way, relevant to a specific topic or user interest. Focused crawlers are important for a great variety of applications, such as digital libraries [30], Web resource discovery [9], competitive intelligence [27], and large Web directories [23], to name a few. Additionally, when compared with traditional crawlers used by general purpose search engines, they reduce the use of resources and favor the scalability of the crawling process, since they avoid the need for covering the entire Web. ...
... In general, focused crawlers guided by classifiers [1, 7, 12, 18, 28, 29, 33, 34] present an additional cost of having to train the classifiers with positive and negative examples of pages to be crawled and, due to the generality of the situations in which they are applied, usually reach recall and precision levels between 40% and 70%. This scenario does not change much with other kinds of focused crawler [6, 16, 22, 24, 25, 27, 30]. Moreover, as we can see from the above discussion, previous work on focused crawling relies on a single concept space (i.e., the topic of the pages) for driving the crawling process. ...
Article
Full-text available
Focused crawlers have as their main goal to crawl pages that are relevant to a specific topic or user interest, playing an important role for a great variety of applications. In general, they work by trying to find and crawl all kinds of pages deemed as related to an implicitly declared topic. However, users are often not simply interested in any document about a topic, but instead they may want only documents of a given type or genre on that topic to be retrieved. In this paper, we describe an approach to focused crawling that exploits not only content-related information but also genre information present in Web pages to guide the crawling process. This approach has been designed to address situations in which the specific topic of interest can be expressed by specifying two sets of terms, the first describing genre aspects of the desired pages and the second related to the subject or content of these pages, thus requiring no training or any kind of preprocessing. The effectiveness, efficiency and scalability of the proposed approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi of computer science courses, job offers in the computer science field and sale offers of computer equipments. These experiments show that focused crawlers constructed according to our genre-aware approach achieve levels of F1 superior to 88\%, requiring the analysis of no more than 65\% of the visited pages in order to find 90\% of the relevant pages. In addition, we experimentally analyze the impact of term selection on our approach and evaluate a proposed strategy for semi-automatic generation of such terms. This analysis shows that a small set of terms selected by an expert or a set of terms specified by a typical user familiar with the topic is usually enough to produce good results and that such a semi-automatic strategy is very effective in supporting the task of selecting the sets of terms required to guide a crawling process.
... [1]), we cannot be sure that quality real-world resources (e.g., athletes, movies, universities) have highly ranked web pages. This has implications for digital libraries and and other systems that build collections by using only search engine APIs [11] or use the APIs to augment focused crawling techniques [17] [23] [9]. Our findings also suggest there is future work in determining what are the additional factors of quality that are missed by conventional hyperlink derived metrics such as PageRank [5] and its many variations. ...
... [1] ), we cannot be sure that quality real-world resources (e.g., athletes, movies, universities) have highly ranked web pages. This has implications for digital libraries and and other systems that build collections by using only search engine APIs [11] or use the APIs to augment focused crawling techniques [17, 23, 9]. Our findings also suggest there is future work in determining what are the additional factors of quality that are missed by conventional hyperlink derived metrics such as PageRank [5] and its many variations. ...
Article
Full-text available
In previous research it has been shown that link-based web page metrics can be used to predict experts' assessment of quality. We are interested in a related question: do expert rankings of real-world entities correlate with search engine rankings of corresponding web resources? For example, each year US News & World Report publishes a list of (among others) top 50 graduate business schools. Does their expert ranking correlate with the search engine ranking of the URLs of those business schools? To answer this question we conducted 9 experiments using 8 expert rankings on a range of academic, athletic, financial and popular culture topics. We compared the expert rankings with the rankings in Google, Live Search (formerly MSN) and Yahoo (with list lengths of 10, 25, and 50). In 57 search engine vs. expert comparisons, only 1 strong and 4 moderate correlations were statistically significant. In 42 inter-search engine comparisons, only 2 strong and 4 moderate correlations were statistically significant. The correlations appeared to decrease with the size of the lists: the 3 strong correlations were for lists of 10, the 8 moderate correlations were for lists of 25, and no correlations were found for lists of 50.
... Focused crawling serves as a popular means for pulling data from external sources. Such crawlers can be used to build digital libraries from scratch [2] or to enhance existing collections in a digital library [6] [17] [27]. ...
... In general, the set of topics is determined primarily by domain experts who perform manual analysis, assisted by semi-automatic tools. To assist with such a complex manual process, several methods have been proposed [27] [17] [5] [16]. In [27], a knowledge discovery task is initiated for each document that was referenced in the collection but has only the metadata available and no actual content. ...
... That ranking can then be used to assign crawling priorities over the missing documents. Other solutions [17] [5] use a classifier that guides a focused crawler to discover relevant documents in external sources. For this purpose, a classifier is trained from a training sample of marked documents. ...
Conference Paper
This work shows how the content of a digital library can be enhanced to better satisfy its users' needs. Missing content is identified by finding missing content topics in the system's query log or in a pre-defined taxonomy of required knowledge. The collection is then enhanced with new relevant knowledge, which is extracted from external sources that satisfy those missing content topics. Experiments we conducted measure the precision of the system before and after content enhancement. The results demonstrate a significant improvement in the system effectiveness as a result of content enhancement and the superiority of the missing content enhancement policy over several other possible policies.