Figure 5 - uploaded by Zonghuan Wu
Content may be subject to copyright.
Example of vertical alignment. 

Example of vertical alignment. 

Source publication
Article
Full-text available
Many databases have become Web-accessible through form-based search interfaces (i.e., HTML forms) that allow users to specify complex and precise queries to access the underlying databases. In general, such a Web search interface can be considered as containing an interface schema with multiple attributes and rich semantic/meta-information; however...

Context in source publication

Context 1
... type: Domain type indicates how many distinct values can be used for an attribute for queries. Four domain types are defined in our model: range , finite (with a finite number of possible values but no range semantics), infinite (with possibly unlimited number of values, e.g., textbox, but no range semantics) and Boolean . For the Boolean type, the attribute just consists of a single checkbox , and this checkbox is usually used to mark a yes-or-no selection. In our model, Boolean type is separated from the regular finite type as this separation makes the Boolean property of the attribute explicit. We only focus on identifying the other three types. In general, an attribute that has a textbox can be assumed to have infinite domain type, and that consists of selection lists can be assumed to have finite domain type. However, this assumption is not always the case. For example, in Figure 3, two textboxes are used to represent a range for the attribute Publication year , thus the attribute should have range domain type; in Figure 5, the attribute price has a selection list that contains several price ranges for users to select. Therefore, identifying domain type of an attribute cannot only depend on the format type of its elements. Instead, the correct domain type should be predicted from the attribute labels, element labels and names, values together with format types. We design a Domain Type Classifier to combine these features and predict correct domain type based on the training examples. Value type: Each attribute on a search interface has its own semantic value type even though all input values are treated as text values to be sent to Web databases through HTTP. For example, the attribute Reader age semantically has integer values, and departure date has date values. Value types currently defined in our model include date , time , datetime , currency , id , number and char , but more types could be added. We can obtain the useful information for identifying value type of attributes from their labels, element names and values. An important thing about value type is that the identical or similar attributes from different search interfaces of the same domain should have the same value type. We design a Value Type Classifier to classify each attribute into an appropriate value type, and construct a feature vector { attLabel , elemLabels, elemNames , elemValues } for each attribute using the available information. Default value: Default values in many cases indicate some semantics of the attributes. For example, in Figure 1.a the attribute Reader age has a default value “ all ages ”. A default value may occur in a selection list, a group of radio buttons and a group of checkboxes. It is always marked as “ checked ” or “ selected ” in the HTML text of search forms. Therefore, it is easy to identify default values. Unit: A unit defines the meaning of an attribute value (e.g., kilogram is a unit for weight ). Different sites may use different units for values of the same attributes. For example, a search interface from USA may use “ USD ” as the unit of its Price attribute, while another from Canada may use “ CAD ” for its Price attribute. Identifying the correct units associated with attribute values can help understand attributes, but not all attributes have units (e.g., attribute author and title do not have applicable units). Unit information can be contained in site URL (e.g., amazon.ca and amazon.co.uk respectively have suffixes ca and uk ) and attributes themselves. We represent an attribute as a feature vector { URLsuffix , attLabel , elemLabels, elemNames , elemValues }, and design a Unit Classifier to identify units. As defined in Section 2, a query submitted through a search interface can be formed in four possible ways (i.e., conjunctive , disjunctive , exclusive and hybrid ). As the exclusive relationship of attributes is already addressed in Section 3.2, we focus on the other three types of relationships in this section. Clearly, if two attributes are in a conjunctive relationship, the number of results returned from a Web database for a query using both attributes to specify conditions cannot be greater than that using only one of these two attributes in another query; and if two attributes are in a disjunctive relationship, then the number of results using both attributes cannot be less than that using only one of them. Thus, in principle, logic relationships between attributes could be identified by submitting appropriate queries to a Web database and by comparing the numbers of hits for different queries. In reality, however, it is rather difficult to automatically find appropriate queries to submit and to extract the numbers of results correctly. Therefore, in our current implementation, we take a simple and practical approach to tackle the problem. We observe that some Web databases contain logic operators (i.e., and , or ) on their interfaces. In this case, it is easy to identify the logic relationships among the involved attributes. Attributes that are not involved in explicit logic operators or exclusive relationships are assumed to have conjunctive relationships among themselves and with other attributes on the interface. Most interfaces have no explicit logic operators or exclusive attributes (e.g., Figure 1.a), so conjunctive relationships are assumed for attributes. If different types of logic relationships exist among the attributes on an interface, then a hybrid relationship is recognized for the interface. This simple approach, though heuristic, is really effective for identifying the logic relationships of attributes as shown in our experiments (by our experiments, 180 out of 184 forms used in our dataset are correctly identified). We have implemented a new version of WISE- i Extractor in Java. We apply an HTML parser package (it is an open source package and can be downloaded from to parse an HTML page to get all the labels and elements of each search interface on the page. Then we use the ...

Similar publications

Article
Full-text available
With the recent explosive growth of the amount of data content and information on the Internet, it has become increasingly difficult for users to find and maximize the utilization of the information found in the internet. Traditional web search engines often return hundreds or thousands of results for a particular search, which is time consuming .I...
Conference Paper
Full-text available
We discuss the design and development of a novel web search engine able to respond to user queries with inferences that follow from the collective human knowledge found across the Web. The engine's knowledge --- or websense --- is represented and reasoned with in a logical fashion, and is acquired autonomously via principled learning, so that the s...
Article
Full-text available
One problem with existing web search engines is their restriction of correlation of key words to textual matching. This both excludes results which closely match the intent of the query and includes results which do not match the intent of the query at all. A great deal of research has gone into finding a scalable way to tag every entity in a query...
Article
Full-text available
Finding relevant information on the Web is difficult for most users. Although Web search applications are improving, they must be more ?intelligent? to adapt to the search domains targeted by queries, the evolution of these domains, and users? characteristics. In this paper, the authors present the TARGET framework for Web Information Retrieval. Th...

Citations

... Supervised label extraction techniques train on a number of annotated forms to infer correct associations between form fields and labels [51,67,86]. Unsupervised label extraction can be fulfilled by analysing the HTML code to identify each label in the text [29,32,53], by inspecting the DOM tree [29,75], or by using visual techniques that analyse the proximity and relative position of fields and labels when the page is rendered in a browser [1,24,29,45,98,103]. Note that the usual western convention is to place a label above its associated field or at its left side. ...
Article
Full-text available
Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with some features that go beyond merely following links, such as the ability to automatically discover search forms that are entry points to the deep Web, fill in such forms, and follow certain paths to reach the deep Web pages with relevant information. Current surveys that analyse the state of the art in deep Web crawling do not provide a framework that allows comparing the most up-to-date proposals regarding all the different aspects involved in the deep Web crawling process. In this article, we propose a framework that analyses the main features of existing deep Web crawling-related techniques, including the most recent proposals, and provides an overall picture regarding deep Web crawling, including novel features that to the present day had not been analysed by previous surveys. Our main conclusion is that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers. In addition, we conclude that the future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability, in order to create effective crawlers that can operate in real-world contexts.
... They used element clustering based on their identified labels to find out relationships among elements. He et al. (2007) proposed to treat the mapped labels and elements as logical attributes. Their schema of a query interface was the set of these logical attributes. ...
... Based on semantic information, they built heuristic rules to combine interface attributes. Naz (2006) used Extensible Markup Language (XML) to represent a deep web query interface schema, similar to that proposed by He et al. (2007). In addition, they added the domains of the elements in the schema. ...
Article
Full-text available
Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.
... Since, Tor is designed in such a way that its codes can be viewed and integrated with the user's software. Users prefer Tor for the privacy it provides since it is difficult to trace back the online activities viz search history, emails, messages, social activities online etc [11]. The ability to communicate confidentially using Tor is exploited by many criminals for committing crimes with the help of Tor. ...
... Gregg et al. [4] presented an adaptive information extraction system prototype that combined multiple information extraction approaches to allow more accurate and resilient data extraction for a widely variety of Web resources. He et al. [5][6][7] presented a research project on database integration called DMSE-Web, and developed an extraction tool called WISE-IExtractor to get query interface schema. Peng et al. [8] presented an attempt to process semantic queries against the spatial database and demonstrated the function of spatial semantic queries via a practical prototype system. ...
Article
Full-text available
Many methods are utilized to extract and process query results in deep Web, which rely on the different structures of Web pages and various designing modes of databases. However, some semantic meanings and relations are ignored. So, in this paper, we present an approach for post-processing deep Web query results based on domain ontology which can utilize the semantic meanings and relations. A block identification model (BIM) based on node similarity is defined to extract data blocks that are relevant to specific domain after reducing noisy nodes. Feature vector of domain books is obtained by result set extraction model (RSEM) based on vector space model (VSM). RSEM, in combination with BIM, builds the domain ontology on books which can not only remove the limit of Web page structures when extracting data information, but also make use of semantic meanings of domain ontology. After extracting basic information of Web pages, a ranking algorithm is adopted to offer an ordered list of data records to users. Experimental results show that BIM and RSEM extract data blocks and build domain ontology accurately. In addition, relevant data records and basic information are extracted and ranked. The performances precision and recall show that our proposed method is feasible and efficient.
... A rule-based method understands a query interface using a set of manually specified rules [Dragut et al. 2009;He et al. 2007;Kaljuvee et al. 2001;Raghavan and Garcia-Molina 2001;Shestakov et al. 2005;Zhang et al. 2004]. ...
... In He et al. [2007], the HTML structure is used to associate the elements and labels. The textual layout of a query interface is represented as an interface expression, which consists of three items t, e, and |, where t represents a label, e represents an element, and | denotes a new row HTML tag, such as <p> or <br>. ...
Article
Full-text available
Users submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we present a novel query interface parsing method, StatParser (Statistical Parser), to automatically extract the hierarchical query capabilities of query interfaces. StatParser automatically learns from a set of parsed query interfaces and parses new query interfaces. StatParser starts from a small grammar and enhances the grammar with a set of probabilities learned from parsed query interfaces under the maximum-entropy principle. Given a new query interface, the probability-enhanced grammar identifies the parse tree with the largest global probability to be the query capabilities of the query interface. Experimental results show that StatParser very accurately extracts the query capabilities and can effectively overcome the problems of existing query interface parsers.
... Form understanding has attracted a number of approaches motivated by deep web search [21,28,29], meta-search engines and web form integration [16,11,[32][33][34]36] and web extraction [30,31]. We focus here on differences to OPAL, for a complete survey see [19,12]. ...
... (1) The most common type encodes (mostly domain independent) observations on typical forms into implicit heuristics or explicit rules MetaQuerier [9,36], ExQ [33], SchemaTree [11], LITE [28], Wise-iExtractor [16], DEQUE [29], and CombMatch [17]. (2) Alternatively, some approaches La-belEx [24] and HMM [18] use machine learning from a set of example forms (possibly of a specific domain). ...
... Wise-iExtractor [16] firstly tokenizes the form to obtain a high-level visual layout description (an interface expressions (IEXP)), distinguishing text fragments, form fields, and delimiters, such as line breaks. It then associates texts and fields by computing the association weight between any given field and the texts in the same line and the two preceding lines, exploiting ending colons, similarities between the text and the field's HTML name attribute, and the text-field distance. ...
Article
Full-text available
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present OPAL, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches for form labeling by a significant margin. For form interpretation, OPAL uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (more than 97 percent accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a Datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of the form interpretations in OPAL through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.
... It is likely that scientists spend more time searching within an RNS to perform deep or vertical searches, i.e., a domainspecifi c search, seeking detailed information on fellow scientists' grants and publications. [24][25][26] We fi nd that locating a scientist's profi le takes less time (3.5 seconds) than a generic navigational task on Google (34.2 seconds) whereas performing browsing through organization pages takes longer (117 seconds). Th e navigational tasks described in the Google study measured the time for a user to perform tasks such as, ''fi nd the home page of Michael Jordan. ...
Article
Usage data for research networking systems (RNSs) are valuable but generally unavailable for understanding scientific professionals' information needs and online collaborator seeking behaviors. This study contributes a method for evaluating RNSs and initial usage knowledge of one RNS obtained from using this method. We designed a log for an institutional RNS, defined categories of users and tasks, and analyzed correlations between usage patterns and user and query types. Our results show that scientific professionals spend more time performing deep Web searching on RNSs than generic Google users and we also show that retrieving scientist profiles is faster on an RNS than on Google (3.5 seconds vs. 34.2 seconds) whereas organization-specific browsing on a RNS takes longer than on Google (117.0 seconds vs. 34.2 seconds). Usage patterns vary by user role, e.g., faculty performed more informational queries than administrators, which implies role-specific user support is needed for RNSs.
... There are many form modeling proposals, ranging from simple models that just keep a record of all the fields to more complex models that add semantics to each field, analysing field tags [2], [10], [12], [18] surrounding text, identifying mandatory fields [19] or relationships between fields [10]. ...
Conference Paper
Full-text available
Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user. KeywordsCrawlers–Web Navigation–Virtual Integration
... Regarding form filling, it is needed a search form model to give semantics to search forms, which are designed by and for users. Deep-web approaches use different types of search form models [2], [36], [43], [55], [65], [75]. The first step to generate a search form model is to identify labels, i.e., text strings that give users an intuition about the semantics of a form field [2], [36], [39], [43], [55], [58], [65], [75]. ...
... Deep-web approaches use different types of search form models [2], [36], [43], [55], [65], [75]. The first step to generate a search form model is to identify labels, i.e., text strings that give users an intuition about the semantics of a form field [2], [36], [39], [43], [55], [58], [65], [75]. There are three different approaches to identify form field labels automatically, and they rely on the idea that the label positions in a search form have significant semantic information. ...
... In textual identification [36], [39], [43], the HTML code of a search form is used to extract field labels. These techniques rely on the idea that analysing HTML code approximately captures the visual layout. ...
Conference Paper
Full-text available
The actual value of the Deep Web comes from integrating the data its applications provide. Such applications offer human-oriented search forms as their entry points, and there exists a number of tools that are used to fill them in and retrieve the resulting pages programmatically. Solution that rely on these tools are usually costly, which motivated a number of researchers to work on virtual integration, also known as metasearch. Virtual integration abstracts away from actual search forms by providing a unified search form, i.e., a programmer fills it in and the virtual integration system translates it into the application search forms. We argue that virtual integration costs might be reduced further if another abstraction level is provided by issuing structured queries in high-level languages such as SQL, XQuery or SPARQL; this helps abstract away from search forms. As far as we know, there is not a proposal in the literature that addresses this problem. In this paper, we propose a reference framework called IntegraWeb to solve the problems of using high-level structured queries to perform deep-web data integration. Furthermore, we provide a comprehensive report on existing proposals from the database integration and the Deep Web research fields, which can be used in combination to address our problem within the previous reference framework. Index Terms—Internet and emerging technologies; Semantic Web.
... Usually, forms contain many small text sections (labels) consisting of only a few words. Unlike wrapping approaches, we have no regularities to exploit as we deal with the data forms (schemas) instead of the data they can generate (instances), other than the implicit conventions of form layout [15, 3]. In other words, we face a more heterogeneous scenario where patterns are scarce. ...
... These techniques analyse visual influence areas (which allow us to find which labels and controls fall under a higher-level heading) and positional relations (adjacency and nesting) to find out relations between labels and controls, and in particular, which controls are organized as a grid, described by row and column headers. Alternative approaches can be found in [15, 3, 9, 10]. 4. Labels are then analysed using simple techniques to find out workflow relations. ...
Article
Full-text available
This paper presents a method for semi-automatically building tailored application ontologies from a set of data acquisition forms. Such ontologies are intended to facilitate the integration of very heterogeneous data generation processes and their linkage to well-known external resources. The resulting tool is being applied to the medical domain, where a wide variety of knowledge and linguistic resources are available. The proposed method consists of first inferring the implicit structure of the forms and then semantically annotating all their textual elements. Finally, by applying a set of patterns over the form inferred structure, the tool generates the ontology axioms that describe it. Our initial results demonstrate that the approach can perform effectively.