Ron Sacks-Davis&#x27;s research while affiliated with RMIT University and other places

Filtered Document Retrieval with Frequency-Sorted Indexes

Queries to text collections are resolved by ranking the documents in the collection and returning the highest-scoring documents to the user. An alternative retrieval method is to rank passages, that is, short fragments of documents, a strategy that can improve effectiveness and identify relevant material in documents that are too large for users to consider as a whole. However, ranking of passages can considerably increase retrieval costs. In this paper we explore alternative query evaluation techniques, and develop new techniques for evaluating queries on passages. We show experimentally that, appropriately implemented, effective passage retrieval is practical in limited memory on a desktop machine. Compared to passage ranking with adaptations of current document ranking algorithms, our new "DO-TOS" passage ranking algorithm requires only a fraction of the resources, at the cost of a small loss of effectiveness.

Article

July 2002

57 Reads

205 Citations

Journal of the American Society for Information Science

Michael Persin

Storage Management for Files of Dynamic Records

Ranking techniques are effective at finding answers in document collections but can be expensive to evaluate. We propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval effectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. The principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. We also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed.

Article

July 2002

22 Reads

13 Citations

Alistair Moffat

Indexing Documents for Queries on Structure, Content and Attributes

We propose a new scheme for managing files of variable-length dynamic records, based on storing the records in large, fixed-length blocks. We demonstrate the e#ectiveness of this scheme for text indexing, showing that it achieves space utilisation of over 95%. With an appropriate block size and caching strategy, our scheme requires on average around two disk accesses---one read and one write---for insertion of or change to a record.

Article

July 2002

58 Reads

35 Citations

Indexing and retrieval techniques for large text databases are well developed, but most of the techniques developed to date assume that the text to be indexed has little or no structure. With the growth in the use of sophisticated markup languages for text, a database system for structured documents should use, not just document content, but structural information and attributes, and should support queries on content, structure and attributes. In this paper we review and compare two recent approaches for accessing document collections. For one of the approaches, position-based indexing, queries are resolved by manipulating ranges of word o sets while for the other, based on a path model, the position of a word is represented in terms of the structural components that enclose it. The former allows slightly smaller indexes; the latter allows more efficient query evaluation.

Databases of Legislation: the Problems of Consolidations

Article

January 2002

36 Reads

2 Citations

System Architectures for Structured Document Data

We discuss a data model for the storage, retrieval and display of legislation in large database collections. Using free-text retrieval, the logical structure of SGML, and the browsing power of hypertext, arbitrary versions of statutes can be displayed, combining the power of traditional paper and current computer research tools.

November 2000

81 Reads

9 Citations

Acoustics, Speech, and Signal Processing Newsletter, IEEE

Michael Fuller

A Standards-Based Approach To Combining Information Retrieval And Database Functionality

Semi-structured data, including but not limited to structured documents, has speci#c characteristics and is used in ways di#erent to tabular data. SGML and XML are widely used to represent information of this type. The demands on systems that manage semi-structured data vary from those on traditional relational systems. This paper reviews the nature and characteristics of semi-structured data, and the functional needs of those applications, including query requirements, document description, manipulation, and document management needs. It examines alternative physical models for semi-structured data, and evaluates and compares alternative system architectures. 1

Download

Article

November 2000

12 Reads

4 Citations

Retrieval of Partial Documents

Alan Kent

This paper describes the SIM architecture and the techniques used to support these standards. It describes how a standard based approach can be used to support structured queries and presents the techniques used by SIM for query evaluation

Article

October 2000

33 Reads

42 Citations

Introduction Provision of answers to informally phrased questions is a central part of information retrieval. These answers traditionally take the form of documents retrieved from a text database, but documents will often be unsatisfactory as answers. They may be large and unwieldy; the answer they represent may be diffuse, and therefore hard for the user to extract; and word-based retrieval systems may be misled by the breadth of vocabulary of a long document into believing it to be relevant. Indexing and returning parts of documents addresses these problems. We have approached the problem of partial documents in two ways. The first approach is to regard documents as an unstructured series of "pages" of text of similar length, each of which can be returned as an answer to a query. We would expect, under this approach, that any bias in the retrieval mechanism towards documents of a particular length should be eliminated. By regarding an answer to be the document from which an

... ore complex and abstract it tends to be." Därför är det bra att SGML tillåter att man skapar en egen DTD anpassad till varje unikt behov. Eftersom SGML är så övergripande är det relativt enkelt att konvertera en uppmärkt text till exempelvis XML eller HTML. Denna anpassningsförmåga är en styrka i många sammanhang. Som syns i tabell 2 ovan, bedömer Wilkinson et. Al (1998) att PostScript och PDF kräver mer utrymme och är mindre flexibla jämfört med SGML, HTML och XML. XML anses vara bäst gällande presentationsflexibilitet medan PostScript och PDF har högst presentationskvalitet. ...
Reference:
Elektronisk publicering: vetenskapliga dokument med åtkomst via webben

Document Computing

Citing Book
January 1998

[...]

... During this process, the evaluation and adaptation of query languages for retrieving geometries (Frank 1982) and several proposals for indexing spatial data structures (e.g., Stonebraker et al. 1983, Guttman 1984 were also significant milestones. These works evolved into the Dual (Schilcher 1985, Ooi et al. 1989, Aref and Samet 1991 and Integrated architectures (Dayal et al. 1987). The latter represented a crucial instant in the development of spatial database architectures and resulted in several Spatial Database Management Systems (SDMS) such as PROBE (Orenstein 1986, Orenstein andManola 1988) and POSTGRES (Stonebraker and Rowe 1986). ...
Reference:
A Survey of Modelling Trends in Temporal GIS

Extending a DBMS for Geographic Applications

Citing Conference Paper
February 1989

Beng Chin Ooi

Hardware address translation for machines with a large virtual memory

Ken J. McDonell

... Address translation hardware for virtual memory implementation is a widely used application of hashing. Ramamohanarao and Sacks-Davis gave a summary of the hardware implementation of the page tables using hashing [7]. The one level scheme was used in IBM system/38 [8], [9] and in IBM RT PC [10] with bit extraction and XOR hashing functions. ...
Reference:
Efficient Hardware Hashing Functions for High Performance Computers

Citing Article
October 1981

Information Processing Letters

K. Kamamohanarao

R. Sacks-Davis

... [7]), a modified Newton-Raphson iteration is used. For this class of problem it is considered by some ([11,13,14]) that IlJ[] is a more suitable parameter than p for selecting algorithms. It can also be useful in estimating the condition number (IIB[]-IIB-1][) of the characteristic matrix [I-hbJ] of Rosenbrock methods [15]. ...
Reference:
Run time estimation of the spectral radius of Jacobians

A type-insensitive ODE code based on second derivative formulas

Citing Article
December 1981

Computers & Mathematics with Applications

R Sacks-Davis

... If they are stored in memory, a typical query of 25 terms will require that 25 inverted file entries be accessed from disk before the query can be processed. If, to conserve memory space, the vocabulary and associated information are merged on disk with the inverted file entries and some form of hashing (such as extensible hashing [14]) is used, the expected number of disk accesses can kept to about 1.2 on average per query term, but at the cost of a 20%–30% expansion in the size of the inverted file. A more practical solution is to allow two accesses per query—the first into the index file containing the vocabulary and term information, including the inverted file entry address, and the second to actually retrieve the inverted file entry. ...
Reference:
Memory efficient ranking

Recursive Linear Hashing.

Citing Article
September 1984

ACM Transactions on Database Systems

Kotagiri Ramamohanarao

Fixed Leading Coefficient Implementation of SD-Formulas for Stiff ODEs

... The initial species concentrations are indicated by the column vector 0 . Numerical solutions for stiff ODE systems defined by Equation (1) can be obtained using explicit or implicit ODE integrator [4][5][6][7][8]. Many ODEs have been used for chemical kinetic models; however, they are stiff [9], ′ = ( , ), 0 ≤ ≤ (1) ...
Reference:
Variable Step Block Hybrid Method for Stiff Chemical Kinetics Problems

Citing Article
December 1980

ACM Transactions on Mathematical Software

A signature file scheme based on multiple organizations for indexing very large text databases

... While some work into curating collections for use in evaluating near duplication detection exists [5][6][7]18], such corpora have primarily focused on document collections that do not reflect our problem domain. Finally, we are not concerned in this work with retrieval of documents using their signatures and leave investigating the applicability such methods [2,3,12,14] to future work. ...
Reference:
On Tradeoffs Between Document Signature Methods for a Legal Due Diligence Corpus

Citing Article
October 1990

Journal of the American Society for Information Science

Alan J. Kent

Querying in a Large Hyperbase

Kotagiri Ramamohanarao

... It is recognized that both navigational and associative access to the database are important. This complies with experience from systems, where large complex databases are to be handled, e.g., 15] . Our interface provides very exible functionality to navigate through relationships, returning individual relationships or (sets of) objects which are related to a speciic object. ...
Reference:
A C++ Database Interface Based on the Entity-Relationship Approach.

Citing Conference Paper
January 1991

[...]

... The way to store the structure of documents in the legal domain has evolved with the appearance of new standards applicable to structured documents. Those approaches where the structure is kept separately from the document content [38,3], have given way to those where the structure forms part of the text of the document [56,12,65,55]. XML allows a document to be tagged according to its semantic structure, and provides additional standards and utilities to access (XPath) and manipulate document components in XML documents (XSLT). ...
Reference:
Principes d'exploitation dynamique des relations inter-documents dans les bibliothèques électroniques : application au domaine juridique

Managing a Digital Library of Legislation.

Citing Conference Paper
Full-text available
January 1997

Phil Anderson

Efficiency of Nested Relational Document Database Systems.

... Niemi andJärvelin & Niemi 1999), like several other authors (e.g. Sacks- Davis et al., 1995;Zobel et al., 1991;Lambrix & Padgham, 2000), have proposed complex entities for representing and manipulating hierarchical documents. Järvelin and others (2000) have shown that complex entities are natural structures for informetrics. ...
Reference:
Advanced query language for manipulating complex entities

Citing Conference Paper
Full-text available
January 1991

James A. Thom