Article

Text Compression Using a 4 Bit Coding Scheme.

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The most frequently used words in natural or printed English are found unexpectedy to contain only an average proportion of the most frequently used letters. This independence of the word and letter frequency distributions is used to minimize the number of bits necesssary to code natural English text. It is shown that mean bit rates of less than 4 per character can be achieved for text using the full ASCII set of 96 characters, by combining a variable bit length representation of each character with a character combination dictionary of a 100 or more common words. A simple practical scheme is presented which uses 4, 8 or 12 bits to code the characters and dictionary words. Using this scheme with a 205 word dictionary, a mean code rate of 3. 87 bits per character is achieved. It is indicated how even this rate might be improved with a larger dictionary or by basing the dictionary on the more numerous word prefixes. 9 refs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Some early compression algorithms were non-adaptive, using a fixed-dictionary mechanism. For example, Pike [9] used a fixeddictionary of 205 popular English words and a variable length coding mechanism to compress typical English text at a rate of less than 4 bits per character. Another recent algorithm, Smaz [10], similarly uses a fixed-dictionary consisting of common digrams and trigrams from English and HTML source code, allowing it to compress even very short strings. ...
Conference Paper
Compression is desirable for network applications as it saves bandwidth. Differently, when data is compressed before being encrypted, the amount of compression leaks information about the amount of redundancy in the plaintext. This side channel has led to the “Browser Reconnaissance and Exfiltration via Adaptive Compression of Hypertext (BREACH)” attack on web traffic protected by the TLS protocol. The general guidance to prevent this attack is to disable HTTP compression, preserving confidentiality but sacrificing bandwidth. As a more sophisticated countermeasure, fixed-dictionary compression was introduced in 2015 enabling compression while protecting high-value secrets, such as cookies, from attacks. The fixed-dictionary compression method is a cryptographically sound countermeasure against the BREACH attack, since it is proven secure in a suitable security model. In this project, we integrate the fixed-dictionary compression method as a countermeasure for BREACH attack, for real-world client-server setting. Further, we measure the performance of the fixed-dictionary compression algorithm against the DEFLATE compression algorithm. The results evident that, it is possible to save some amount of bandwidth, with reasonable compression/decompression time compared to DEFLATE operations. The countermeasure is easy to implement and deploy, hence, this would be a possible direction to mitigate the BREACH attack efficiently, rather than stripping off the HTTP compression entirely.
... The compression achieved by digram coding can be improved by generalizing it to "n-grams"-fragments of n consecutive characters [Pike 1981;Tropper 19821. The problem with a static n-gram scheme is that the choice of phrases for the dictionary is critical and depends on the nature of the text being encoded, yet we want phrases to be as long as possible. ...
Article
Full-text available
The best schemes for text compression use large models to help them predict which characters will come next. The actual next characters are coded with respect to the prediction, resulting in compression of information. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling that are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.
Conference Paper
Compression is desirable for network applications as it saves bandwidth; however, when data is compressed before being encrypted, the amount of compression leaks information about the amount of redundancy in the plaintext. This side channel has led to successful CRIME and BREACH attacks on web traffic protected by the Transport Layer Security (TLS) protocol. The general guidance in light of these attacks has been to disable compression, preserving confidentiality but sacrificing bandwidth. In this paper, we examine two techniques—heuristic separation of secrets and fixed-dictionary compression—for enabling compression while protecting high-value secrets, such as cookies, from attack. We model the security offered by these techniques and report on the amount of compressibility that they can achieve.
Article
Text compression methods can be divided into two classes: symbolwise and parsing . Symbolwise methods assign codes to individual symbols, while parsing methods assign codes to groups of consecutive symbols (phrases). The set of phrases available to a parsing method is referred to as a dictionary . The vast majority of parsing methods in the literature use greedy parsing (including nearly all variations of the popular Ziv-Lempel methods). When greedy parsing is used, the coder processes a string from left to right, at each step encoding as many symbols as possible with a phrase from the dictionary. This parsing strategy is not optimal, but an optimal method cannot guarantee a bounded coding delay. An important problem in compression research has been to establish the relationship between symbolwise methods and parsing methods. This paper extends prior work that shows that there are symbolwise methods that simulate a subset of greedy parsing methods. We provide a more general algorithm that takes any nonadaptive greedy parsing method and constructs a symbolwise method that achieves exactly the same compression. Combined with the existence of symbolwise equivalents for two of the most significant adaptive parsing methods, this result gives added weight to the idea that research aimed at increasing compression should concentrate on symbolwise methods, while parsing methods should be chosen for speed or temporary storage considerations.
Article
An industrial case study of data compression within the library automation domain is described. A context dependent approach, where the individual records require file-independent compression and expansion, is evaluated. The discussed approach favorably compares against popular compression algorithms. Comparisons were made against commercially available implementations of the conventional compression schemes. The described approach is now in use by The Library Corporation.
Article
There has been an unparalleled explosion of textual information flow over the internet through electronic mail, web browsing, digital library and information retrieval systems, etc. Since there is a persistent increase in the amount of data that needs to be transmitted or archived, the importance of data compression is likely to increase in the near future. Virtually, all modern compression methods are adaptive models and generate variable-bit-length codes that must be decoded sequentially from beginning to end. If there is any error during transmission, the entire file cannot be retrieved safely. In this article we propose few fault-tolerant methods of text compression that facilitate decoding to begin with any part of compressed file not necessarily from the beginning. If any sequence of one or more bytes is changed during transmission of compressed file due to various reasons, the remaining data can be retrieved safely. These algorithms also support reversible decompression. [Article copies are available for purchase from InfoSci-on-Demand.com].
Article
Full-text available
The problem of noiselessly encoding a message when prior statistics are known is considered. The close relationship of arithmetic and enumerative coding for this problem is shown by computing explicit arithmetic coding probabilities for various enumerative coding examples. This enables a comparison to be made of the coding efficiency of Markov models and enumerative codes as well as a new coding scheme intermediate between the two. These codes are then extended to messages whose statistics are not known {em a priori} Two adaptive codes are described for this problem whose coding efficiency is upper-bounded by the extended enumerative codes. On some practical examples the adaptive codes perform significantly better than the nonadaptive ones.
Article
Keyword-based text retrieval engines have been and will continue to be essential to text-based information access systems because they serve as the basic building blocks to high-level text analysis systems. Traditionally, text compression and text retrieval are teated as independent problems. Text files are compressed and indexed separately. To answer a keyword-based query, text files are first uncompressed, and then searched sequentially or via an inverted index. This paper describes the design, implementation and evaluation of a novel integrated text compression and indexing scheme called ITCI, which combines the dictionary data structures for compression and indexing, and allows direct search through compressed text. The performance results show that ITCI's compression efficiency is within 7% to 17% of GZIP, which is among the best lossless data compression algorithms, The sum of the compressed text and the inverted index is only between 55% to 76% of the original text size, while ...
Article
Keyword based search engines are the basic building block of text retrieval systems. Higher level systems like content sensitive search engines and knowledgebased systems still rely on keyword search as the underlying text retrieval mechanism. With the explosive growth in content, Internet and Intranet information repositories require efficient mechanisms to store as well as index data. In this paper we discuss the implementation of the Shrink and Search Engine (SASE) framework which unites text compression and indexing to maximize keyword search performance while reducing storage cost. SASE features the novel capability of being able to directly search through compressed text without explicit decompression. The implementation includes a search server architecture, which can be accessed from a Java front-end to perform keyword search on the Internet. The performance results show that the compression efficiency of SASE is within 7-17% of GZIP one of the best lossless compression scheme...
Article
Consideration of the general problem of grouping a collection of objects of known frequencies in order to balance the frequencies of the resulting sets, and presents mathematical criteria for increasing the balance of a grouping by rearranging the sets. This leads to a method for monotomically increasing the relative entropy of the collection of objects by a sequence of multiway splitting or coalescing steps. The theory is applied to the threshold method used by Lynch to generate equifrequent sets. It is shown that, for typical distributions, some steps in the threshold method will decrease the balance and furthermore, for some distributions, the threshold method will give very poor results.
Article
A computer program, developed as a psychological model of speech segmentation, is presented as a method of recoding natural language for economical storage or transmission. The program builds a dictionary of frequently occurring letter strings. Where these strings occur in a text they may be replaced by a short code, thus effecting a compression of up to 49%. The strings mjy also be used as key 'words' in a document retrieval system. The method has the particular merit of simplicity in building the dictionary and efficiency in encoding data.
Article
Text compression, using a coding dictionary of 200 plus n-grams, can halve file storage costs and double data transmission rates. However, software based text compression systems are slow and expensive in storage. Two hardware systems (based on a fixed record length and a byte-organized variable record length Associative Parallel Processor), for the compression and decompression of textural information, are described. Algorithms are given and their execution illustrated with practical examples. A feasibility study, comparing the performances and costs of the two systems with a conventional microprocessor (Digital LSI-11) implementation is also reported.
Article
An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
Article
The ability of a dictionary encoder to reduce the redundancy of printed English text is evaluated by simulation on a general-purpose digital computer. The dictionary encoder matches segments of the input text to entries of a stored dictionary which contains frequently occurring sequences of letters. The text is thus defined as the succession of code designations corresponding to the selected dictionary entries. Since, for a normal piece of text, fewer binary digits are needed to specify the code designations than the text itself, the encoding produces a compressed equivalent of the original input. In addition to evaluating encoder performance the simulator also collects language statistics which are used for optimization of the encoder logic and the dictionary entries. For a broad type of English language text (news dispatches prepared for newspaper publication) the number of binary digits required to represent a piece of text can be reduced by 50 percent when using a 1000-entry dictionary. While a better compression than 50 percent is theoretically possible it may be difficult to realize, but a compression of the input text to 60 to 70 percent of its original size appears to be easily realizable with a small dictionary.