Conference PaperPDF Available

ECON: An Approach to Extract Content from Web News Page

Authors:

Abstract and Figures

This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.
Content may be subject to copyright.
A preview of the PDF is not available
... The process of eliminating the noises present in the web pages are specified as web content outlier mining [5][6]. Also, identifying the valuable information from the document with non-informative content is the most significant process as it helps the user to access the web pages in the handheld devices such as smartphones and PDAs [7][8]. ...
... The method constructs the DOM tree where each node corresponds to the tags in HTML and it computes the texts and links in the nodes based on which the score is computed. ECON is another simple method proposed by Guo et al., [7] in which the DOM tree is constructed and analysed. The main and modest idea adopted by this method is that the number of punctuations present in the core content block will be always higher than the noise block. ...
... The idea is to analyse punctuations present in the web content. Though it seems to be simple, the literature shows that the method provides efficient results [7]. The core content in the web page has more punctuations such as comma, periods than other noise contents. ...
... Other methods are based on style tree induction, that is detection of similarities of DOM trees on site-level (Yi et al., 2003;Vieira et al., 2006). Overall, efforts made to automatically generate wrappers have been centered on three different approaches (Guo et al., 2010): wrapper induction (e.g. building a grammar to parse a web page), sequence labeling (e.g. ...
... labeled examples or a schema of data in the page), and statistical analysis. This approach combined to the inspection of DOM tree characteristics (Wang et al., 2009;Guo et al., 2010) is a common ground to the information retrieval and computational linguistics communities, with the categorization of HTML elements and linguistic features (Ziegler and Skubacz, 2007) for the former and boilerplate removal for the latter. ...
... 4 Web crawler is the best tool for news extraction. Guo et al. 5 provide an effective and easy way, Extract COtent from web News (ECON), to extract content from any news web page written in any language automatically. It exploits document object model (DOM) tree structure of news web page and uses features of DOM tree to do its job. ...
... Almost 70% data of every news story web page are occupied with irrelevant material. In order to separate and extract actual content of story from a news story web page, the technique used in Guo et al. 5 has been employed. Thus, the crawler application uses breadth first search to explore all the news while ignoring the noisy data. ...
Article
Full-text available
In the past few decades, the whole world has been badly affected by terrorism and other law-and-order situations. The newspapers have been covering terrorism and other law-and-order issues with relevant details. However, to the best of our knowledge, there is no existing information system that is capable of accumulating and analyzing these events to help in devising strategies to avoid and minimize such incidents in future. This research aims to provide a generic architectural framework to semi-automatically accumulate law-and-order-related news through different news portals and classify them using machine learning approaches. The proposed architectural framework discusses all the important components that include data ingestion, preprocessor, reporting and visualization, and pattern recognition. The information extractor and news classifier have been implemented, whereby the classification sub-component employs widely used text classi-fiers for a news data set comprising almost 5000 news manually compiled for this purpose. The results reveal that both support vector machine and multinomial Naïve Bayes classifiers exhibit almost 90% accuracy. Finally, a generic method for calculating security profile of a city or a region has been developed, which is augmented by visualization and reporting components that maps this information onto maps using geographical information system.
... Experimental the results showed that ECON can achieve high accuracy and fully meet the scalable extraction requirements. Moreover, ECON can be applied to the web page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic and It can be implemented very easily [6]. ...
Conference Paper
Full-text available
With a specific final goal of breaking down a story set of data on the article, first concentrate the essential data, for example, Title, date and section of the body. In the meantime, expel meaningless data, such as image, registration, footer, Notice, route and prescribed news. The problem is that the News articles organizations are changing as indicated bytime and also change based on the news source and evensegment of it. In this sense, it is essential that a model resume while providing hidden news settings articles. We claim that a model based on machine learning. It is smarter to provide new information than a control based on Model for some tests. Furthermore, I recommend it The concussion data in the body can be expelled in the light of fact that we characterize a grouping unit as a leaf axis itself On the other hand, general models based on machine learning cannot be expel the shock data. Since they consider the characterization unit as a center of the road axis part of the arrangement of the sheet, cannot order a cube of the leaf itself.
... Haber sitesi sayfalarında esas içeriğin paragraf etiketleri arasında olduğu varsayımıyla bu yöntem sadece haber sitelerine özel olarak geliştirilmiştir. Haber, blog ve tartışma sitelerine özel olarak geliştirilen farklı çalışmalar [13,14] olmakla birlikte, e-ticaret siteleri gibi ürün odaklı siteleri hedefleyen araştırmalar [9] da mevcuttur. ...
Conference Paper
Full-text available
ABSTRACT — As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully. ÖZET — Web sitelerindeki içerik miktarı arttıkça Web sayfalarından otomatik içerik çıkarımı daha da önemli hale gelmektedir. Bu konuda literatürde çok sayıda çalışma yapılmış olmasına rağmen HTML’in esnek yapısından dolayı problemi tam olarak çözen bir yöntem ortaya konamamıştır. Belirli oranlarda başarı gösteren yöntemlerin performansları da değişen ve gelişen Web yapısıyla birlikte zamanla azalmaktadır. Bu çalışmada ise Web sayfalarındaki metin içeriğini otomatik olarak indirmeye yönelik bir tarayıcı uzantısı geliştirilmiştir. Geliştirilen bu uzantı, Document Object Model (DOM) yapısından yararlanan bir çözümleyici ile Web sayfasındaki metin içeriğini tüm etiketlerden ve kodlardan temizleyerek %100 duyarlılık oranıyla çıktı olarak vermektedir. Dilden bağımsız olarak çalışan bu tarayıcı uzantısı Türkiye’deki farklı türde popüler Web siteleri üzerinde test edilmiş ve başarılı şekilde çalıştığı görülmüştür.
Article
Full-text available
Automatic news repository collection systems involve a news crawler that extracts news from different news portals, subsequently, these news need to be processed to figure out the category of a news article e.g. sports, politics, showbiz etc. In this process there are two main challenges first one is to place a news article under the right category of news, while the second one is to detect a duplicate news, i.e. when the news are being extracted from multiple sources, it is highly probable to get the same news from many different portals, resulting into duplicate news; failing to which may result into inconsistent statistics obtained after pre-processing the news text. This problem becomes more pertinent when we deal with human loss news involving crime, accident etc. related news articles. As the system may count the same news many times resulting into misleading statistics. In order to address these problems, this research presents the following contributions. Firstly, a news corpus comprising of human loss news of different categories has been developed by gathering data from different sources of well-known and authentic news websites. The corpus also includes a number of duplicate news. Secondly, a comparison of different classification approaches has been conducted to empirically find out the best suitable text classifier for the categorization of different sub-categories of human loss news. Lastly, methods have been proposed and compared to detect duplicate news from the corpus by involving different pre-processing techniques and widely used similarity measures, cosine similarity, and Jaccard’s coefficient. The results show that conventional text classifiers are still relevant and perform well in text classification tasks as MNB has given 89.5% accurate results. While, Jaccard coefficient exhibits much better results than Cosine similarity for duplicate news detection with different pre-processing variations with an average accuracy of 83.16%.
Conference Paper
With the rapid development of Internet, information grows explosively, and traditional search engine have failed to meet the needs of users. This paper proposes a customer service automatic answering system with a high-quality knowledge base. First of all, based on unsupervised learning algorithm, this system extracts the question and answer pairs from documents and store them in the knowledge base. Then employing semantic analysis module and the method of Natural Language Processing (NLP), this system gains the meaning of the customers' question accurately, then retrieve the knowledge base and return a high-resolution answer to the user. Furthermore, we construct a dialog management module, which makes reasonable guesses on issues that cannot be matched, and records the dialogue history so that the question-answering system can give more intelligent responses. Finally, due to the diversity of the document structure and the complexity of Chinese natural language, this system adds an edifying function that can add, delete, and modify the question and answer pair in the knowledge. Therefore, our customer service automatic answering system can be more intelligent and efficient than the existing question and answer system.
Article
Full-text available
Knowledge in web documents, Relevance ranking of webpages and so on are some of the under-researched areas in web content mining (WCM). Apart from the general data mining tools used for knowledge discovery in web, there have been few attempts at reviewing WCM and these were from the perspective of the methods used and the problems solved but not in sufficient depth. This existing literature review attempts does not also reveal which problems have been under-researched and which application area has the most attention when it gets to WCM. The goal of this systematic review is to make available a comprehensive and semi-structured overview of WCM methods, problems and solutions proffered. To provide a comprehensive literature review on this subject, 57 publications which include journals, conferences proceeding, and workshops were considered between the periods of 1999-2018. The findings reveal that updating dynamic content, efficient content extraction, eliminating noise blocks etc remain the most prominent challenges associated with WCM with a very high attention on solving these problems in a more efficient manner. Also, most of the solutions proffered to the problems still come with their various limitations which make this area of research fertile for future research. Caching dynamic web data. With regard to content, the techniques used for content extraction in WCM consist of used Data Update Propagation (DUP), Association rule, Object Dependence Graphs, classification techniques, Document Object Model, Vision-Based Segmentation, Hyperlink-Induced Topic Search and so on. Finally, the study revealed that WCM has been mostly applied to general websites which include random webpages seeking to extract specific parameters. The review was able to identify the limitations of the current research on the subject matter and identify future research opportunities in WCM.
Article
Full-text available
Columbia's Newsblaster tracking and summarization system is a robust system that clusters news into events, categorizes events into broad topics and summarizes multiple articles on each event. Here we outline our most current work on tracking events over days, producing summaries that update a user on new information about an event, outlining the perspectives of news coming from different countries and clustering and summarizing non-English sources.
Conference Paper
Full-text available
Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.
Conference Paper
Full-text available
We present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of versions ...
Conference Paper
Full-text available
The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semi- structured Web documents. Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule. Previous work in this field aims to learn extraction rules from users' training example. In this paper, we propose IEPAD, a system that automatically discovers extraction rules from Web pages. The system can automatically identify record boundary by repeated pattern mining and multiple sequence alignment. The discovery of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines.
Conference Paper
Full-text available
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.
Conference Paper
Full-text available
Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).
Conference Paper
We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amount of text and the number of links it contains. The method is site-independent and does not use any language- based features. We tested our algorithm on a set of 1120 news article pages from 27 domains. This dataset was also used elsewhere to test the performance of another, state-of- the-art baseline system. Our algorithm achieved over 97% precision and 98% recall, and an average processing speed of under 15ms per page. This precision/recall performance is slightly below the baseline system, but our approach requires signicantly less computational work.