Example of vertical alignment.

Source publication

Towards Deeper Understanding of the Search Interfaces of the Deep Web

Article

Full-text available

Jun 2007

Many databases have become Web-accessible through form-based search interfaces (i.e., HTML forms) that allow users to specify complex and precise queries to access the underlying databases. In general, such a Web search interface can be considered as containing an interface schema with multiple attributes and rich semantic/meta-information; however...

Context 1

... type: Domain type indicates how many distinct values can be used for an attribute for queries. Four domain types are defined in our model: range , finite (with a finite number of possible values but no range semantics), infinite (with possibly unlimited number of values, e.g., textbox, but no range semantics) and Boolean . For the Boolean type, the attribute just consists of a single checkbox , and this checkbox is usually used to mark a yes-or-no selection. In our model, Boolean type is separated from the regular finite type as this separation makes the Boolean property of the attribute explicit. We only focus on identifying the other three types. In general, an attribute that has a textbox can be assumed to have infinite domain type, and that consists of selection lists can be assumed to have finite domain type. However, this assumption is not always the case. For example, in Figure 3, two textboxes are used to represent a range for the attribute Publication year , thus the attribute should have range domain type; in Figure 5, the attribute price has a selection list that contains several price ranges for users to select. Therefore, identifying domain type of an attribute cannot only depend on the format type of its elements. Instead, the correct domain type should be predicted from the attribute labels, element labels and names, values together with format types. We design a Domain Type Classifier to combine these features and predict correct domain type based on the training examples. Value type: Each attribute on a search interface has its own semantic value type even though all input values are treated as text values to be sent to Web databases through HTTP. For example, the attribute Reader age semantically has integer values, and departure date has date values. Value types currently defined in our model include date , time , datetime , currency , id , number and char , but more types could be added. We can obtain the useful information for identifying value type of attributes from their labels, element names and values. An important thing about value type is that the identical or similar attributes from different search interfaces of the same domain should have the same value type. We design a Value Type Classifier to classify each attribute into an appropriate value type, and construct a feature vector { attLabel , elemLabels, elemNames , elemValues } for each attribute using the available information. Default value: Default values in many cases indicate some semantics of the attributes. For example, in Figure 1.a the attribute Reader age has a default value “ all ages ”. A default value may occur in a selection list, a group of radio buttons and a group of checkboxes. It is always marked as “ checked ” or “ selected ” in the HTML text of search forms. Therefore, it is easy to identify default values. Unit: A unit defines the meaning of an attribute value (e.g., kilogram is a unit for weight ). Different sites may use different units for values of the same attributes. For example, a search interface from USA may use “ USD ” as the unit of its Price attribute, while another from Canada may use “ CAD ” for its Price attribute. Identifying the correct units associated with attribute values can help understand attributes, but not all attributes have units (e.g., attribute author and title do not have applicable units). Unit information can be contained in site URL (e.g., amazon.ca and amazon.co.uk respectively have suffixes ca and uk ) and attributes themselves. We represent an attribute as a feature vector { URLsuffix , attLabel , elemLabels, elemNames , elemValues }, and design a Unit Classifier to identify units. As defined in Section 2, a query submitted through a search interface can be formed in four possible ways (i.e., conjunctive , disjunctive , exclusive and hybrid ). As the exclusive relationship of attributes is already addressed in Section 3.2, we focus on the other three types of relationships in this section. Clearly, if two attributes are in a conjunctive relationship, the number of results returned from a Web database for a query using both attributes to specify conditions cannot be greater than that using only one of these two attributes in another query; and if two attributes are in a disjunctive relationship, then the number of results using both attributes cannot be less than that using only one of them. Thus, in principle, logic relationships between attributes could be identified by submitting appropriate queries to a Web database and by comparing the numbers of hits for different queries. In reality, however, it is rather difficult to automatically find appropriate queries to submit and to extract the numbers of results correctly. Therefore, in our current implementation, we take a simple and practical approach to tackle the problem. We observe that some Web databases contain logic operators (i.e., and , or ) on their interfaces. In this case, it is easy to identify the logic relationships among the involved attributes. Attributes that are not involved in explicit logic operators or exclusive relationships are assumed to have conjunctive relationships among themselves and with other attributes on the interface. Most interfaces have no explicit logic operators or exclusive attributes (e.g., Figure 1.a), so conjunctive relationships are assumed for attributes. If different types of logic relationships exist among the attributes on an interface, then a hybrid relationship is recognized for the interface. This simple approach, though heuristic, is really effective for identifying the logic relationships of attributes as shown in our experiments (by our experiments, 180 out of 184 forms used in our dataset are correctly identified). We have implemented a new version of WISE- i Extractor in Java. We apply an HTML parser package (it is an open source package and can be downloaded from to parse an HTML page to get all the labels and elements of each search interface on the page. Then we use the ...

View in full-text

TABLE 1 : CLASSIFICATION OF SYSTEMS BASED ON THEIR VULNERABILITIES TO...

Figure 10: Experimental Results Showing The Efficiency Of PALM

PALM: Preprocessed Apriori For Logical Matching Using Map Reduce Algorithm-PALM (Preprocessed Apriori For Logical Matching) Algorithm, Apriori Algorithm, Map Reducing Algorithm and Pattern Matching

Article

Full-text available

Jul 2012

With the recent explosive growth of the amount of data content and information on the Internet, it has become increasingly difficult for users to find and maximize the utilization of the information found in the internet. Traditional web search engines often return hundreds or thousands of results for a particular search, which is time consuming .I...

Machines with WebSense

Conference Paper

Full-text available

May 2013

Loizos Michael

We discuss the design and development of a novel web search engine able to respond to user queries with inferences that follow from the collective human knowledge found across the Web. The engine's knowledge --- or websense --- is represented and reasoned with in a logical fashion, and is acquired autonomously via principled learning, so that the s...

Semantic Matching: Climbing the Tree for Correct and Relevant Query Results

Article

Full-text available

One problem with existing web search engines is their restriction of correlation of key words to textual matching. This both excludes results which closely match the intent of the query and includes results which do not match the intent of the query at all. A great deal of research has gone into finding a scalable way to tag every entity in a query...

Adaptive Ontology-Based Web Information Retrieval

Article

Full-text available

Sep 2013

Finding relevant information on the Web is difficult for most users. Although Web search applications are improving, they must be more ?intelligent? to adapt to the search domains targeted by queries, the evolution of these domains, and users? characteristics. In this paper, the authors present the TARGET framework for Web Information Retrieval. Th...

Deep Web crawling: a survey

Article

Full-text available

Jul 2019
WORLD WIDE WEB

Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with some features that go beyond merely following links, such as the ability to automatically discover search forms that are entry points to the deep Web, fill in such forms, and follow certain paths to reach the deep Web pages with relevant information. Current surveys that analyse the state of the art in deep Web crawling do not provide a framework that allows comparing the most up-to-date proposals regarding all the different aspects involved in the deep Web crawling process. In this article, we propose a framework that analyses the main features of existing deep Web crawling-related techniques, including the most recent proposals, and provides an overall picture regarding deep Web crawling, including novel features that to the present day had not been analysed by previous surveys. Our main conclusion is that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers. In addition, we conclude that the future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability, in order to create effective crawlers that can operate in real-world contexts.

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

Article

Full-text available

Feb 2019
INFORM SYST FRONT

Chichang Jou

Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.

Dark Web and Trading of Illegal Drugs

Article

Full-text available

Dec 2018

Post-processing of Deep Web Information Extraction Based on Domain Ontology

Article

Full-text available

Nov 2013
ADV ELECTR COMPUT EN

Many methods are utilized to extract and process query results in deep Web, which rely on the different structures of Web pages and various designing modes of databases. However, some semantic meanings and relations are ignored. So, in this paper, we present an approach for post-processing deep Web query results based on domain ontology which can utilize the semantic meanings and relations. A block identification model (BIM) based on node similarity is defined to extract data blocks that are relevant to specific domain after reducing noisy nodes. Feature vector of domain books is obtained by result set extraction model (RSEM) based on vector space model (VSM). RSEM, in combination with BIM, builds the domain ontology on books which can not only remove the limit of Web page structures when extracting data information, but also make use of semantic meanings of domain ontology. After extracting basic information of Web pages, a ranking algorithm is adopted to offer an ordered list of data records to users. Experimental results show that BIM and RSEM extract data blocks and build domain ontology accurately. In addition, relevant data records and basic information are extracted and ranked. The performances precision and recall show that our proposed method is feasible and efficient.

Understanding Query Interfaces by Statistical Parsing

Article

Full-text available

May 2013
ACM T WEB

Users submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we present a novel query interface parsing method, StatParser (Statistical Parser), to automatically extract the hierarchical query capabilities of query interfaces. StatParser automatically learns from a set of parsed query interfaces and parses new query interfaces. StatParser starts from a small grammar and enhances the grammar with a set of probabilities learned from parsed query interfaces under the maximum-entropy principle. Given a new query interface, the probability-enhanced grammar identifies the parse tree with the largest global probability to be the query capabilities of the query interface. Experimental results show that StatParser very accurately extracts the query capabilities and can effectively overcome the problems of existing query interface parsers.

The Ontological Key: Automatically Understanding and Integrating Forms to Access the Deep Web

Article

Full-text available

Oct 2012
VLDB J

Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present OPAL, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches for form labeling by a significant margin. For form interpretation, OPAL uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (more than 97 percent accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a Datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of the form interpretations in OPAL through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.

An Initial Log Analysis of Usage Patterns on a Research Networking System

Article

Aug 2012
CTS-CLIN TRANSL SCI

Usage data for research networking systems (RNSs) are valuable but generally unavailable for understanding scientific professionals' information needs and online collaborator seeking behaviors. This study contributes a method for evaluating RNSs and initial usage knowledge of one RNS obtained from using this method. We designed a log for an institutional RNS, defined categories of users and tasks, and analyzed correlations between usage patterns and user and query types. Our results show that scientific professionals spend more time performing deep Web searching on RNSs than generic Google users and we also show that retrieving scientist profiles is faster on an RNS than on Google (3.5 seconds vs. 34.2 seconds) whereas organization-specific browsing on a RNS takes longer than on Google (117.0 seconds vs. 34.2 seconds). Usage patterns vary by user role, e.g., faculty performed more informational queries than administrators, which implies role-specific user support is needed for RNSs.

A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts

Conference Paper

Full-text available

Sep 2011

Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user. KeywordsCrawlers–Web Navigation–Virtual Integration

On using high-level structured queries for integrating deep-web information sources

Conference Paper

Full-text available

Aug 2011

The actual value of the Deep Web comes from integrating the data its applications provide. Such applications offer human-oriented search forms as their entry points, and there exists a number of tools that are used to fill them in and retrieve the resulting pages programmatically. Solution that rely on these tools are usually costly, which motivated a number of researchers to work on virtual integration, also known as metasearch. Virtual integration abstracts away from actual search forms by providing a unified search form, i.e., a programmer fills it in and the virtual integration system translates it into the application search forms. We argue that virtual integration costs might be reduced further if another abstraction level is provided by issuing structured queries in high-level languages such as SQL, XQuery or SPARQL; this helps abstract away from search forms. As far as we know, there is not a proposal in the literature that addresses this problem. In this paper, we propose a reference framework called IntegraWeb to solve the problems of using high-level structured queries to perform deep-web data integration. Furthermore, we provide a comprehensive report on existing proposals from the database integration and the Deep Web research fields, which can be used in combination to address our problem within the previous reference framework. Index Terms—Internet and emerging technologies; Semantic Web.

FAETON: Form Analysis and Extraction Tool for ONtology construction

Article

Full-text available

Oct 2010
Int J Comput Appl Tech

This paper presents a method for semi-automatically building tailored application ontologies from a set of data acquisition forms. Such ontologies are intended to facilitate the integration of very heterogeneous data generation processes and their linkage to well-known external resources. The resulting tool is being applied to the medical domain, where a wide variety of knowledge and linguistic resources are available. The proposed method consists of first inferring the implicit structure of the forms and then semantically annotating all their textual elements. Finally, by applying a set of patterns over the form inferred structure, the tool generates the ontology axioms that describe it. Our initial results demonstrate that the approach can perform effectively.

Example of vertical alignment.

Context in source publication

Similar publications

Citations