Conference PaperPDF Available

ECON: An Approach to Extract Content from Web News Page

April 2010

April 2010

DOI:10.1109/APWeb.2010.11

Source
DBLP

Conference: Advances in Web Technologies and Applications, Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010, Busan, Korea, 6-8 April 2010

Authors:

Linhai Song

University of Wisconsin–Madison

Yu Wang

Chinese Academy of Sciences

Show all 5 authorsHide

This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.

A Sample of Web news page

…

A DOM tree of a Web news page (a)

…

A DOM tree of a Web news page (b)

…

A part of the DOM tree in Figure 2

…

A Sample of a part of a short-page

…

Figures - uploaded by Yu Wang

Content may be subject to copyright.

Content uploaded by Yu Wang

Content may be subject to copyright.

A preview of the PDF is not available

Extraction of Core Web Content from Web Pages using Noise Elimination

Article

Aug 2020

Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

Conference Paper

Jan 2021

Adrien Barbaresi

An architectural framework for information integration using machine learning approaches for smart city security profiling

Article

Full-text available

Oct 2020

In the past few decades, the whole world has been badly affected by terrorism and other law-and-order situations. The newspapers have been covering terrorism and other law-and-order issues with relevant details. However, to the best of our knowledge, there is no existing information system that is capable of accumulating and analyzing these events to help in devising strategies to avoid and minimize such incidents in future. This research aims to provide a generic architectural framework to semi-automatically accumulate law-and-order-related news through different news portals and classify them using machine learning approaches. The proposed architectural framework discusses all the important components that include data ingestion, preprocessor, reporting and visualization, and pattern recognition. The information extractor and news classifier have been implemented, whereby the classification sub-component employs widely used text classi-fiers for a news data set comprising almost 5000 news manually compiled for this purpose. The results reveal that both support vector machine and multinomial Naïve Bayes classifiers exhibit almost 90% accuracy. Finally, a generic method for calculating security profile of a city or a region has been developed, which is augmented by visualization and reporting components that maps this information onto maps using geographical information system.

International Journal of Innovative Research in Computer and Communication Engineering A Heuristic Approach for Web Content Extraction

Conference Paper

Full-text available

Dec 2018

With a specific final goal of breaking down a story set of data on the article, first concentrate the essential data, for example, Title, date and section of the body. In the meantime, expel meaningless data, such as image, registration, footer, Notice, route and prescribed news. The problem is that the News articles organizations are changing as indicated bytime and also change based on the news source and evensegment of it. In this sense, it is essential that a model resume while providing hidden news settings articles. We claim that a model based on machine learning. It is smarter to provide new information than a control based on Model for some tests. Furthermore, I recommend it The concussion data in the body can be expelled in the light of fact that we characterize a grouping unit as a leaf axis itself On the other hand, general models based on machine learning cannot be expel the shock data. Since they consider the characterization unit as a center of the road axis part of the arrangement of the sheet, cannot order a cube of the leaf itself.

Development of Browser Extension for HTML Web Page Content Extraction

Conference Paper

Full-text available

Jun 2020

ABSTRACT — As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully. ÖZET — Web sitelerindeki içerik miktarı arttıkça Web sayfalarından otomatik içerik çıkarımı daha da önemli hale gelmektedir. Bu konuda literatürde çok sayıda çalışma yapılmış olmasına rağmen HTML’in esnek yapısından dolayı problemi tam olarak çözen bir yöntem ortaya konamamıştır. Belirli oranlarda başarı gösteren yöntemlerin performansları da değişen ve gelişen Web yapısıyla birlikte zamanla azalmaktadır. Bu çalışmada ise Web sayfalarındaki metin içeriğini otomatik olarak indirmeye yönelik bir tarayıcı uzantısı geliştirilmiştir. Geliştirilen bu uzantı, Document Object Model (DOM) yapısından yararlanan bir çözümleyici ile Web sayfasındaki metin içeriğini tüm etiketlerden ve kodlardan temizleyerek %100 duyarlılık oranıyla çıktı olarak vermektedir. Dilden bağımsız olarak çalışan bu tarayıcı uzantısı Türkiye’deki farklı türde popüler Web siteleri üzerinde test edilmiş ve başarılı şekilde çalıştığı görülmüştür.

SVM-based web content mining with leaf classification unit from DOM-tree

Conference Paper

Feb 2017

Automatic Content Extraction for Live Streaming Web Page Based on the Comparison Approach

Conference Paper

Sep 2020

Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus

Article

Full-text available

May 2020

Automatic news repository collection systems involve a news crawler that extracts news from different news portals, subsequently, these news need to be processed to figure out the category of a news article e.g. sports, politics, showbiz etc. In this process there are two main challenges first one is to place a news article under the right category of news, while the second one is to detect a duplicate news, i.e. when the news are being extracted from multiple sources, it is highly probable to get the same news from many different portals, resulting into duplicate news; failing to which may result into inconsistent statistics obtained after pre-processing the news text. This problem becomes more pertinent when we deal with human loss news involving crime, accident etc. related news articles. As the system may count the same news many times resulting into misleading statistics. In order to address these problems, this research presents the following contributions. Firstly, a news corpus comprising of human loss news of different categories has been developed by gathering data from different sources of well-known and authentic news websites. The corpus also includes a number of duplicate news. Secondly, a comparison of different classification approaches has been conducted to empirically find out the best suitable text classifier for the categorization of different sub-categories of human loss news. Lastly, methods have been proposed and compared to detect duplicate news from the corpus by involving different pre-processing techniques and widely used similarity measures, cosine similarity, and Jaccard’s coefficient. The results show that conventional text classifiers are still relevant and perform well in text classification tasks as MNB has given 89.5% accurate results. While, Jaccard coefficient exhibits much better results than Cosine similarity for duplicate news detection with different pre-processing variations with an average accuracy of 83.16%.

Customer Service Automatic Answering System Based on Natural Language Processing

Conference Paper

Sep 2019

With the rapid development of Internet, information grows explosively, and traditional search engine have failed to meet the needs of users. This paper proposes a customer service automatic answering system with a high-quality knowledge base. First of all, based on unsupervised learning algorithm, this system extracts the question and answer pairs from documents and store them in the knowledge base. Then employing semantic analysis module and the method of Natural Language Processing (NLP), this system gains the meaning of the customers' question accurately, then retrieve the knowledge base and return a high-resolution answer to the user. Furthermore, we construct a dialog management module, which makes reasonable guesses on issues that cannot be matched, and records the dialogue history so that the question-answering system can give more intelligent responses. Finally, due to the diversity of the document structure and the complexity of Chinese natural language, this system adds an edifying function that can add, delete, and modify the question and answer pair in the knowledge. Therefore, our customer service automatic answering system can be more intelligent and efficient than the existing question and answer system.

A Systematic Review of Current Trends in Web Content Mining

Article

Full-text available

Aug 2019

Knowledge in web documents, Relevance ranking of webpages and so on are some of the under-researched areas in web content mining (WCM). Apart from the general data mining tools used for knowledge discovery in web, there have been few attempts at reviewing WCM and these were from the perspective of the methods used and the problems solved but not in sufficient depth. This existing literature review attempts does not also reveal which problems have been under-researched and which application area has the most attention when it gets to WCM. The goal of this systematic review is to make available a comprehensive and semi-structured overview of WCM methods, problems and solutions proffered. To provide a comprehensive literature review on this subject, 57 publications which include journals, conferences proceeding, and workshops were considered between the periods of 1999-2018. The findings reveal that updating dynamic content, efficient content extraction, eliminating noise blocks etc remain the most prominent challenges associated with WCM with a very high attention on solving these problems in a more efficient manner. Also, most of the solutions proffered to the problems still come with their various limitations which make this area of research fertile for future research. Caching dynamic web data. With regard to content, the techniques used for content extraction in WCM consist of used Data Update Propagation (DUP), Association rule, Object Dependence Graphs, classification techniques, Document Object Model, Vision-Based Segmentation, Hyperlink-Induced Topic Search and so on. Finally, the study revealed that WCM has been mostly applied to general websites which include random webpages seeking to extract specific parameters. The review was able to identify the limitations of the current research on the subject matter and identify future research opportunities in WCM.

Columbia's newsblaster: new features and future directions

Article

Full-text available

Mar 2004

Columbia's Newsblaster tracking and summarization system is a robust system that clusters news into events, categorizes events into broad topics and summarizes multiple articles on each event. Here we outline our most current work on tracking events over days, producing summaries that update a user on new information about an event, outlining the perspectives of news coming from different countries and clustering and summarizing non-English sources.

Extracting Structured Data from Web Pages

Conference Paper

Full-text available

Jun 2003

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

RoadRunner: Automatic data extraction from data-intensive Web sites

Conference Paper

Full-text available

Jan 2002

We present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of versions ...

IEPAD: Information Extraction Based on Pattern Discovery

Conference Paper

Full-text available

Apr 2001

The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semi- structured Web documents. Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule. Previous work in this field aims to learn extraction rules from users' training example. In this paper, we propose IEPAD, a system that automatically discovers extraction rules from Web pages. The system can automatically identify record boundary by repeated pattern mining and multiple sequence alignment. The discovery of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines.

Automatic web news extraction using tree edit distance

Conference Paper

Full-text available

May 2004

The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.

Data extraction and label assignment for web databases

Conference Paper

Full-text available

Jan 2003

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).

Wrapper induction for information extraction

Article

Jan 1997

Wrapper Induction for Information Extraction

Conference Paper

May 1997

Coreex: content extraction from online news articles

Conference Paper

Oct 2008

We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amount of text and the number of links it contains. The method is site-independent and does not use any language- based features. We tested our algorithm on a set of 1120 news article pages from 27 domains. This dataset was also used elsewhere to test the performance of another, state-of- the-art baseline system. Our algorithm achieved over 97% precision and 98% recall, and an average processing speed of under 15ms per page. This precision/recall performance is slightly below the baseline system, but our approach requires signicantly less computational work.

Columbia's Newsblaster: New Features and Future Directions.

Conference Paper

Jan 2003

ECON: An Approach to Extract Content from Web News Page

Abstract and Figures

Recommended publications

Petrić, B. (2014). English-medium journals in Serbia: Editors' perspectives. In K. Bennett (Ed.), Th...

Traslitterazione e trascrizione automatiche mediante l'uso di un foglio elettronico: il caso della l...

Supplemental Materials 11–13

Prices of Music Monographs and Scores as Reflected in Notes, 1998-2003