Conference PaperPDF Available

Efficient Compression Scheme for Large Natural Text Using Zipf Distribution

May 2019

May 2019

DOI:10.1109/ICASERT.2019.8934651

Conference: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)

Authors:

Md. Ashiq Mahmood

Khulna University of Engineering and Technology

K. M. Azharul Hasan

Khulna University of Engineering and Technology

Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character's location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.

Compression of file for different data set

…

Figures - uploaded by Md. Ashiq Mahmood

Content may be subject to copyright.

Content uploaded by Md. Ashiq Mahmood

Content may be subject to copyright.

1st International Conference on Advances in Science, Engineering and Robotics Technology 2019 (ICASERT 2019)

Efficient Compression Scheme for Large

Natural Text Using Zipf Distribution

Md. Ashiq Mahmood and K.M. Azharul Hasan

Department of Computer Science and Engineering

Khulna University of Engineering & Technology (KUET)

Khulna, Bangladesh

Email: ashiqmahmoodbipu@gmail.com, azhasan@gmail.com

Abstract— Data compression is the way toward modifying,

encoding or changing over the bit structure of data in such a way

that it expends less space. Character encoding is somewhat

related to data compression which represents a character by

some sort of encoding framework. Encoding is the way toward

putting a succession of characters into a specific arrangement for

effective transmission or capacity. Compression of data covers a

giant domain of employments including information

correspondence, information storing and database development.

In this paper we propose an efficient and new compression

algorithm for large natural datasets where any characters is

encoded by 5 bits called 5-Bit Compression (5BC). The algorithm

manages an encoding procedure by 5 bits for any characters in

English and Bangla using table look up. The look up table is

constructed by using Zipf distribution. The Zipf distribution is a

discrete distribution of commonly used characters in different

languages. 8 bit characters are converted to 5 bits by parting the

characters into 7 sets and utilizing them in a solitary table. The

character’s location is then used uniquely encoding by 5 bits. The

text can be compressed by 5BC is more than 60% of the actual

text. The algorithm for decompression to recover the original

data is depicted also. After the output string of 5BC is produced,

LZW and Huffman techniques further compress the output

string. Optimistic performance is demonstrated by our

experimental result.

Keywords—

encoding; compression; decompression; 5-bit

compression; compression ratio.

NTRODUCTION

Data compression is broadly used and imperative issue in

information science applications because of capacity limit and

exchange speed of different systems [1][2]. There are

undeniable advantages which may be accomplished

incorporates diminishing capacity necessity and data

transmission, encoding littler bit numbers, short time essential

for broadcasting, powerful channel utilization etc. [3][5].

Having a decent packed scheme conspire remapping any text

by shorter bit is vital [1][2][5]. So a new system of encoding

can be important representing to a source of data precisely

conceivable utilizing least bit numbers putting the actual

meaning unaltered [1][2][11]. In many cases, to reestablish

data into the original organization because of the compacted

structure is vital [6]. Zipf distribution is used in this paper to

construct the dictionary of characters. It actually came from

Zipf’s law. Zipf's law can be said to an experimental law

formulated utilizing mathematical statistics that alludes to the

way that numerous sorts of data studied in the physical and

social sciences can be approximated with a Zipf distribution.

Zipf's law expresses that given some corpus of natural

language utterances, the frequency of any word is conversely

corresponding to its position in the frequency table [4]. In this

way the most frequent word will happen approximately twice

as regularly as the second most frequent word, three times as

regularly as the third most frequent word. The compression

techniques of data are ordered extensively as lossy and

lossless [8]. Lossy technique comprises towards a change to

imperative against immaterial data, trailed by compression of

lossless data in the essential case and disposing rest of the

data. As a matter of fact, Lossy technique functions by

decreasing amount of document by reducing superfluous text

or data [1][11]. The 5BC algorithm which we proposed is

lossless [1][2][8]. The technique deals with the mapping of

both forward and backward. 5BC alters characters from 8 bit

to 5 bits by parting the characters into 7 sets and utilizing them

in a solitary table. By using Zipf distribution, the characters in

the table i.e. dictionary are mapping [4][12]. Same characters

in the corresponding set code are kept together since it

produces shorter sequence of bit code. The 5BC can be able

to compress data by more than 60% of the actual data. 5BC

compression technique is a logical scheme that shows

promising efficiency.

II.

WORKS

The Huffman technique is supposed to be a lossless technique

which relies on entropy [9]. It actually construct some binary

trees so the codes can be found from the binary tree sequence.

Huffman algorithm is said to be the variable length encoding

as it mainly works with length of the variable. In the contrary,

our 5BC algorithm constantly settled to work with 5 bits as it

were. The rule used in this technique is utilizing progressively

visit each data via a shorter amount of bits for encoding. In

the records of JPEG, Huffman technique is used. JPEG 2000

(JP2) is supposed to be an image compression standard and

coding system [13]. Static Huffman and Adaptive Huffman

are the classification of Huffman algorithm [9]. Static

Huffman ascertain frequencies right off bat and afterward

they create some typical tree for the forward and backward

mapping [9][15] though Adaptive Huffman functions by

building up trees by figuring frequencies thus constructs two

trees in both of these running processes [14]. There exists

some compression techniques which are based on dictionaries

rather than factual organizations [13][15]. A well-known

dictionary based technique is Lempel-Zev Welch or LZW

technique [9]. In this algorithm, an individual string αK is

constructed by a string α including a K which can be found

through the dictionaries. α is additionally added into the

dictionary [10]. At the end, the character of string has

additionally supplanted again [5]. The existing dictionary

functions in form of dynamic information thus it doesn’t show

to be static [10]. This crude dictionary can be recuperated

through the decompressed information at the time of

disentangling it. The LZW dictionary can be precisely worked

amid the definite specific during Compression and

decompression, and disposed of in the wake of packing or

process of decompressing has been ended [10]. There is

another technique named LZ Technique [5][13] which is

additionally popular because of it’s straightforwardness as

well as It has effective compression proportions. Our

technique utilizes a table look up which remains a static one

and it will not change during the arrival of new data. The look

up table i.e. the dictionary remains fixed and can be utilized

for the purpose of compression and decompression as time

required.

III.

EFFICIENT

COMPRESSION

SCHEME

FOR

LARGE

NATURAL

EXT

Generally, for the purpose of encoding a character in a

memory, 8 bit space in a memory is required [1][2]. We

introduce a character encoding scheme of 5 bit called 5BC

representing a character via 5 bits in lieu of 8 bits. The

algorithm works with all the general printable characters

covering both English and Bangla characters that has in

normal keyboard. A lookup table is built for representing the

Characters by 5 bit. The table is illustrated in TABLE I.

Using 5 bit it can be  combination possible. Among

the 32 combinations 25 combinations are used for

compression and the rest of the 7 are used for set

representation. The character sets are splitted within the

number of 7 sets. In the lookup table,

• Bangla characters are placed in Set1, Set2, Set3

• English characters are placed in Set4, Set5, Set6 &

Set7

• 7 binary combination are taken for the 7 set.

The Unicode takes 16 bit of memory for representing the

natural language character set. This modified 5 bit characters

will be converted to 16 bit Unicode characters.

TABLE I. Lookup table for 5BC

Serial

no.

Decimal

value

Binary

value

Set1

Set2

Set3

Set4

Set5

Set6

Set7

00000







Space

00001







00010







00011







00100







00101







00110







00111







01000







01001







01010







01011







01100







01101







01110







01111







(

10000







)

;

10001







{

10010







}

10011







10100







10101





[

10110





]

10111





11000





11001

Set-1

11010

Set-2

11011

Set-3

11100

Set-4

11101

Set-5

11110

Set-6

11111

Set-7

Algorithm 1: Forward Mapping

Input: Normal string S,

Output: An encoded compressed string Sc

1: By adding set change,

2: Representing the string S by S`

3: By using the lookup table,

4: Representing the string S` by its corresponding 5 bits

representation.

5: The representative string S’ contain K bits.

Example 1:

Original Text (Input):

I am Bangladeshi

Set Representation: Set4 I Set5 space am space Set4 B

Set5 angladeshi

Decimal Representation: 28 8 29 0 2 14 0 28 1 29 2 15

8 13 2 5 6 20 9 10

5 bit representation: 11100 01000 11101 00000 00010

01110 00000 11100 00001 ….…

16 Bit representation: 1110001000111010

0000000100111000 …………

Compressed String: 脻Ჟ䑨ꡉǐ廷

Example 2:

Compressed String: 脻Ჟ䑨ꡉǐ廷

16 Bit representation:

--- 1110001000111010

脻 --- 0000000100111000

After dividing it by 16

The 5 bit Representation:

11100 --- 28 --- Set4

01000 --- 8 --- I

11101 --- 29 --- space

00000--- 0 ---- Set5 …………..

Decompressed String:

Set4 I Set5 space am space Set4 B Set5 angladeshi

Original Text: I am Bangladeshi

IV.

ANALYTICAL

ANALYSIS

In 5BC, each characters of 8 bits are represented via 5 bits

hence 3 bits are saved. So the probable efficiency of 5BC is

(8-5)/8= 37.5%. But after compressing it with Unicode

scheme the best efficiency would be more than 60%.

We tried to develop a theoretical analysis to calculate the

efficiency precisely. The parameters which are considered for

analytical analysis are shown in following table.

TABLE II. PARAMETERS FOR ANALYTICAL

EVALUATION

Parameter

Description

Original data

Encoded data

Length of the set

Length of I

Length of E; Ld = Li + Ls

Total bytes for I

Total bytes for E

Efficiency( Te/T)

9: compare this new C with the next set code,

When C is repeated, the next bit stream will be

Cropped

10: Including the bit code of the repeated C.

11: end if

12: From the remaining Kth bits stream, taking 5 bits

and representing it according to the set of

characters in the lookup table.

13: After excluding the number of set, the actual

string can be generated.

6: if K%16=0 then

7: jump into end

8: end if

9: else if K%16!=0 then

10: Sum up m bits in the formation of the last 5 bit set

code by taking every bit of the set

11: Code one by one and increasing in the same way

until (K+m)%16=0

12: end else if

13: Represent the 16 bits from (K+m)/K to,

14: Compress string Sc by applying corresponding

Unicode text.

Algorithm 2: Backward Mapping

Input: Compressed String Sc

Output: An encoded compressed string Sc

1: Representing the string Sc according to it’s

corresponding 16 bits value of binary number from

the Unicode table.

2: Let, Sc contain K bits.

3: For determining 5 bit representation,

4: At first find    .

5: Store the 1st set code of K’ in C.

6: Take every 5 bit set code of the K’ bit and

compare it with C.

7: if C is not repeated then

8: which means there is an occurrence of set

change so C is updated with the new 5 bit set code and

Let Input data I has length Li. After encoding it becomes

E and length become Ld

Where     

Total bytes for I,     

Total bytes for E,     

     

  



  



  

  

    

  

  

  



   

  



 

  

  

  

For usable compression,    then

  

     

 

        

 

    

       

    



  

  





 



It is ac tual ly the proportion between the set length and

original length. It demonstrates that this technique is properly

usable until the length of set remains equal or less than 3/5th

of the actual d a t a length. And if so then we will get a

positive efficiency. But if the ratio is greater than 3/5th the

efficiency would be negative.

XPERIMENTAL

ESULT

Since 5BC produces a large Unicode string in a printable

format. It can be further compressed after applying them

in LZW and Huffman. We actually experiment 5

algorithms namely, 5BC, LZW, Huffman, 5BC+LZW and

5BC+Huffman. The experimental result shows the

comparison between this 5 algorithms. We have done 3

kind of test comprising the Compression of files,

Compression ratio and Compression time.

Compression of files: Here we can see the comparison

between the compressions sizes achieved after applying

the above mentioned compression techniques on the

different datasets.

Compression ratio: Here we can see the comparison

between the compressions ratios achieved after applying

the above mentioned compression techniques on the

different datasets

Compression time: Here we can see the comparison

between the compressions times achieved after applying

the above mentioned compression techniques on the

different datasets.

A. Compression of files:

Fig. 1. Compression of file for different data set

From Fig. 1. The result shows that LZW provides the best

result in terms of compression of different size file. But our

proposed 5BC algorithm also provides very promising result.

Our proposed technique performs far better than the well-

known Huffman algorithm. And another fact that in terms of

increasing data or increasing the file size our technique will

converge the compression result of LZW. And if we combine

these two techniques with 5BC, both the algorithm

5BC+Huffman as well as 5BC+LZW improves their

performance.

0.5

1.5

2.5

3.5

4.5

12345678910

FILE SIZE(MB)

NUMBER OF FILES

C O M P RE S S IO N O F F I L ES

File size 5BC LZW

Huffman 5BC+LZW 5BC+Huffman

B. Compression Ratio:

Fig. 2. Demonstrates the compression ratio for different

datasets. The result shows that LZW provides very good

compression ratio at the initial level when the file size is small

but as the size of data is increasing, the efficiency rate will be

poor Because LZW provides bad compression ratio for less

dictionary size and as the character length increases the

dictionary will not easily be made of. So performance

reduced. But our proposed idea provides a constant efficiency

Fig. 2. Compression ratio for different data set

rate at any file size and the performance improves for the

increasing number of datasets. It provides a quite better

efficiency rate than Huffman. And at some point it will

converge the efficiency rate of LZW also. The reason

Huffman provides poor efficiency that Huffman technique

derives the data from the probability or frequency of

occurrence of the possible values in the source symbol [13].

So if the data set is quite large, the frequency generation would

be more difficult and a large number of individual symbol will

be created. As a result, it provides compression ratio which is

average. Another important scenario is that after combining

this both algorithm with 5BC, the performance of Huffman

will be dramatically improved because this time it works with

the compressed data by the 5BC which is much smaller than

the original data. The performance of LZW is also improved.

C. Compression time:

Figure 3 provides the compression times for different

datasets. The result shows that our proposed idea shows

average result time when after combining it with LZW and

Huffman the performance improves drastically. Though

Huffman individually provides the time quite similar with our

proposed idea.

Fig. 3. Compression time for different data set

VI.

CONCLUSION

In this research, we try to propose an efficient encoding

and decoding technique for data compression which

provides a fast and promising performance. This technique

is a 5 bit compression scheme and it’s character set i.e.

dictionary is produced by the Zipf distribution. The

proposed technique provides a promising efficiency rate

which is more than 60%. This technique can be easily

utilized compressing a huge amount of natural datasets.

Both the mapping forward and backward make this

compression mapping complete as it shows i.e. while

decomposing we can achieve the original text. This

technique can be implemented in building an innovative

and efficient software which can be applied in

communication of data, storage of data and database

engineering. It can be utilized to parallel processing

environment as well as load balancing technique to

achieve promising encoding time.

REFERENCES

[1] Ashiq Mahmood, Tarique Latif, and K. M. Azharul Hasan, “An

Efficient 6 bit Encoding Scheme for Printable Characters by table

lookup (6BE)”, International Conference on Electrical, Computer and

Communication Engineering (ECCE), pp. 468-472, 2017.

[2] Md. Ashiq Mahmood, Tarique Latif, K. M. Azharul Hasan and riadul

Islam, “A Feasible 6 Bit Text Database Compression Scheme with

Character Encoding (6BC)”, 2018 21st International Conference of

Computer and Information Technology (ICCIT), pp. 1-6, 2018

[3] Nieves R. Brisaboa, Antonio Fariña Gonzalo Navarro “Lightweight

natural language text compression”, Information retrieval 10(1), 1-33,

2007.

[4] Wentian Li “Random Texts Exhibit Zipfs-Law-Like Word”, Ieee

Transactions On Informationtheory, Vol. 38, No. 6,November 1992

[5] Md. Abul Kalam Azad, Rezwana Sharmeen, Shabbir Ahmad and S. M.

Kamruzzaman, “An Efficient Technique for Text Compression” The 1st

International Conference on Information Management and Business,

pp. 467-473, 2005.

[6] M. M. Kodabagi, M. V. Jerabandi, Nagaraj Gadagin, “Multilevel

security and compression of text data using bit stuffing and huffman

coding”, 2015 International Conference on Applied and Theoretical

Computing and Communication Technology (iCATccT), pp. 800 - 804,

2015.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 . 0 5 1

0 . 1 1 2

0 . 3 0 3

0 . 5 7 2

0 . 8 1 8

1 . 0 5

1 . 5 5

2 . 5

3 . 8 4

4 . 9 9

EFFICIENCY

FILE SIZE(MB)

RE SU LT O F E FFI CI EN CY

5BC LZW Huffman

5BC+LZW 5BC+Huffman

0.051

0.112

0.303

0.572

0.818

1 . 0 5

1 . 5 5

2 . 5

3 . 8 4

4 . 9 9

COMPRESSION TIME(HOURS)

FILE LENGTH(MB)

CO MPR ES S I ON T IM E

5BC LZW Huffman

5BC+LZW 5BC+Huffman

[7] Muthukumar Murugesan, T. Ravichandran, “Evaluate Database

Compression Performance and Parallel Backup,” International Journal

of Database Management Systems (IJDMS) Vol.5(4), 17-25, 2013.

[8] Vu H. Nguyen, Hien T. Nguyen, Hieu N. Duong and Vaclav Snasel ”

n-Gram-Based Text Compression, ” Computational Intelligence and

Neuroscience, Volume 2016.

[9] Senthil Shanmuga Sundaram, and Robert Lourdusamy, “A

Comparative Study Of Text Compression Algorithms,” International

Journal of Wisdom Based Computing, Vol.1 (3), pp. 68-76, 2011

[10] A Carus, and A Mesut, “Fast text compression using multiple static

dictionaries,” Information Technology Journal, 1013-1021, 2010.

[11] Replacement And Bit Reductionn”, SIPM, FCST, ITCA, WSE, ACSIT,

CS & IT 06, pp. 407–416, 2012.

[12] N. Miladinovic, “Minimum-cost encoding of information. Coding with

digits of unequal cost”, 4th International Conference on

Telecommunications in Modern Satellite, Cable and Broadcasting

Services. TELSIKS'99 (Cat. No.99EX365), pp. 600 – 603, vol.2, 1999

[13] M. Sangeetha, P. Betty, G.S. Nanda Kumar, “A biometrie iris image

compression using LZW and hybrid LZW coding algorithm,” 2017

International Conference on Innovations in Information, Embedded and

Communication Systems (ICIIECS), pp. 1-6, 2013.

[14] Wenjun Huang, Weimin Wang, Hui Xu, “A Lossless Data Compression

Algorithm for Real-time Database,”. 2006 6th World Congress on

Intelligent Control and Automation, , pp. 6645-6648, 2006

[15] Ahmed Mokhtar, A. Mansour, Mona A. M. Fouad, “Dictionary based

optimization for adaptive compression techniques”, 2013 36th

International Convention on Information and Communication

Technology, Electronics and Microelectronics (MIPRO), pp. 421 – 425,

2013

PSO-based feature extraction of unknown protocol data frame

Article

Full-text available

Sep 2022
COMPUTING

In today’s network information confrontation, due to security reasons, the protocols used by both parties are often undisclosed and the protocol format is unknown, and the communication data is in the form of the continuous and irregular bitstream. How to extract features without prior knowledge is an urgent problem to be solved. Therefore, this study proposes a method for the feature extraction of unknown protocol data frames based on the particle swarm optimization (PSO) algorithm to address the problem of low adaptability and low accuracy of frequent thresholds. Given the features of the bitstream data frames, the proposed method segments the bitstream data through Zipf’s law. The PSO algorithm is employed to adapt the frequent threshold to the uncertainty of the unknown protocols, and the short frequent sequence is then obtained under the adaptive threshold. The continuous location information is then applied to splice the excavated short frequent sequences to determine the final frequent sequence set. To filter out the effective association rules, the chi-squared test is conducted to analyze the association rules mined between frequent sequences. According to the simulation results, the proposed method managed to achieve the frequent extraction of adaptive thresholds in different datasets, whereas its accuracy was higher than that of the comparison algorithm. Moreover, the method proposed in this paper has certain practical significance for theoretical research and application in this field.

A Feasible Compression Scheme for Bangla Natural Text

Conference Paper

Mar 2023

A method to improve full-text search performance of MongoDB

Article

Oct 2022

B-Tree based text indexes used in MongoDB are slow compared to different structures such as inverted indexes. In this study, it has been shown that the full-text search speed can be increased significantly by indexing a structure in which each different word in the text is included only once. The Multi-Stream Word-Based Compression Algorithm (MWCA), developed in our previous work, stores word dictionaries and data in different streams. While adding the documents to a MongoDB collection, they were encoded with MWCA and separated into six different streams. Each stream was stored in a different field, and three of them containing unique words were used when creating a text index. In this way, the index could be created in a shorter time and took up less space. It was also seen that Snappy and Zlib block compression methods used by MongoDB reached higher compression ratios on data encoded with MWCA. Search tests on text indexes created on collections using different compression options shows that our method provides 19 to 146 times speed increase and 34% to 40% less memory usage. Tests on regex searches that do not use the text index also shows that the MWCA model provides 7 to 13 times speed increase and 29% to 34% less memory usage.

A Feasible 6 Bit Text Database Compression Scheme with Character Encoding (6BC)

Conference Paper

Full-text available

Dec 2018

Character encoding implies representing a repertoire of characters by some sort of encoding framework. Encoding a character in a compelling procedure is in every case estimable in light of the fact that it requires a couple of bits and least investment for information. It has an enormous region of utilization including data correspondence, data stockpiling, transmission of textual information and database innovation. In this paper, a new compression technique is proposed for text data which encodes a character by 6 bits to be specific 6-Bit Text database Compression (6BC). This strategy works with a system of encoding by 6 bit for characters which are printable by utilizing a lookup table. 8 bit characters are converted into 6 bit by this procedure and it partitions the characters into 4 sets. At that point, it utilizes the location of the characters uniquely to encode it by 6 bit. This strategy is likewise utilized in database innovation by compressing the text data in a connection of a database. With the assistance of a lookup table, 6BC can compress and in addition decompress the original data. Reverse procedure for decompression to get back the original data is additionally detailed. The result of 6BC is further applied to compress by the known algorithm to be specific Huffman and LZW. Promising efficiency is appeared by our experimental result. The procedure is further demonstrated by some examples and descriptions.

An efficient 6 bit encoding scheme for printable characters by table look up

Conference Paper

Full-text available

Feb 2017

Encoding a character in an efficient way is always desirable because it takes less bits and less time for data. It has many application areas including data communication, data storage and database technology. In this paper we propose a new compression algorithm for text data that encodes any character by 6 bits called 6-bit encoding (6BE). The scheme deals with an encoding technique by 6 bits for printable characters using table look up. 6BE converts the 8 bit characters to 6 bits by dividing the characters into 5 sets and using them in a single table. The location of character is then used uniquely to encode by 6 bits. The 6BE can compress the text by 25% of the original text. The reverse algorithm for decomposition to get back the original text is also described. 6BE encoding produces another set of characters. The result of 6BE string is further compressed by Huffman and LZW algorithm. Our experimental result shows promising performance. Examples and descriptions are provided to explain the technique.

n -Gram-Based Text Compression

Article

Full-text available

Jan 2016
Comput Intell Neurosci

We propose an efficient method for compressing Vietnamese text using n -gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n -grams and then encodes them based on n -gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n -gram is encoded by two to four bytes accordingly based on its corresponding n -gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n -gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.

Evaluate Database Compression Performance and Parallel Backup

Article

Full-text available

Aug 2013

Globally accessed databases are having massive number of transactions and database size also beenincreased as MBs and GBs on daily basis. Handling huge volume of data’s in real time environment are bigchallenges in the world. Global domain providers like banking, insurance, finance, railway, and manyother government sectors are having massive number of transactions. Generally global providers are usedto take the backup of their sensitive data’s multiple times in a day, also many of the providers used to takethe data backup once in few hours as well. Implementing backup often for huge size of databases withoutlosing single bit of data is a susceptible problem and challengeable task.This research addresses the difficulties of the above problems and provides the solution to how to optimizeand enhance the process for compress the real time database and implement backup of the database inmultiple devices using parallel way. Our proposed efficient algorithms provides the solution to compressthe real time databases more effectively and improve the speed of backup and restore operations.

A Comparative Study Of Text Compression Algorithms

Article

Full-text available

Dec 2011

Data Compression is the science and art of representing information in a compact form. For decades, Data compression has been one of the critical enabling technologies for the ongoing digital multimedia revolution. There are lot of data compression algorithms which are available to compress files of different formats. This paper provides a survey of different basic lossless data compression algorithms. Experimental results and comparisons of the lossless compression algorithms using Statistical compression techniques and Dictionary based compression techniques were performed on text data. Among the statistical coding techniques the algorithms such as Shannon-Fano Coding, Huffman coding, Adaptive Huffman coding, Run Length Encoding and Arithmetic coding are considered. Lempel Ziv scheme which is a dictionary based technique is divided into two families: those derived from LZ77 (LZ77, LZSS, LZH and LZB) and those derived from LZ78 (LZ78, LZW and LZFG). A set of interesting conclusions are derived on their basis.

Lightweight natural language text compression

Article

Full-text available

Jan 2007

Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11% larger compressed files. This work describes End-Tagged Dense Code and (s,c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.

A biometrie iris image compression using LZW and hybrid LZW coding algorithm

Conference Paper

Mar 2017

Multilevel security and compression of text data using bit stuffing and huffman coding

Conference Paper

Oct 2015

Data security and compression is the common requirement for most of the storage and transmission related applications. In this paper, a new method for multilevel security and compression of text data using bit stuffing and Huffman coding is presented. The proposed method comprises various processing modules for securing and compressing the text data through encryption process. The encryption process involves three phases to encrypt the data. In the first phase, the bits of every eighth character in the text will be embedded in the most significant bit positions of preceding seven characters. This process creates an encrypted text data which will be further processed for second level encryption. During second phase, every character of encrypted text is XORed with secret key provided by the sender that authenticates the sender. In the third phase, final encryption is done using Huffman encoding method. In decryption process, the ciphertext passes through three phases to reconstruct the original text. The encrypted data is decrypted using Huffman decoding method in the first phase. Then, the decrypted data and secret key that authenticates the receiver are used to reconstruct the original text in the last two phases of decryption process. Hence, the method provides multistage security for text data and also compresses the text data to larger extent. The results obtained are highly encouraging and the system is very effective in providing high level security and higher shrinkage or reduction of memory that result in reduction of bandwidth and transmission time. An average percentage of shrinkage or reduction of memory is 45.41%.

Dictionary based optimization for adaptive compression techniques

Conference Paper

Jan 2013

This paper proposes an enhancement of the two levels dictionary based compression. The enhancement is based on optimizing the mapping tables overhead of the second level compression, usually variable length code, that used over the first level compression (the dictionary based one). The case study will be applied on 'A-M' compression and will introduce the 'A-M' dictionary version 2. This work guarantees the reduction of overhead from the second level compression up to 45% of the original mapping table. Besides, the use of 'A-M' dictionary reduces the processing effort of recreation of dynamic dictionary. The idea is based on reducing the dictionary field of the mapping table by sorting, grouping, and then eliminating common parts (6 Most Significant Bits - MSB) of the dictionary words during compression, which is reconstructed back during decompression. The process is completely transparent with respect to the two levels of compression (dictionary and variable length) and can be applied not only for the 'A-M' compression, but also for any other dictionary based compression technique.

A Lossless Data Compression Algorithm for Real-time Database

Article

Jan 2006

Real-time database applied in process industry requests the performance of mass data and high speed. So the process data in database must be compressed effectively and reliably. According to process data characteristic, a lossless compression algorithm was designed based on LZW algorithm and RLE algorithm. The compression algorithm first classified process data by characteristic, and then different compression methods were designed for all kinds of data. In order to increase compression radio, pretreatment approaches were implemented before compression. The compression algorithm solved the difficulties of low compression radio and compression speed. Performance test shows that this algorithm obviously improved the real-time performance and efficiency when accessing the process database. It will be widely applied in MES and PCS

Efficient Compression Scheme for Large Natural Text Using Zipf Distribution

Abstract and Figures

Recommended publications

A Feasible 6 Bit Text Database Compression Scheme with Character Encoding (6BC)

A Dictionary based Compression Scheme for Natural Language Text with Reduced Bit Encoding

An Efficient Compression Scheme for Bangla Natural Text by 16 Bit Unicode Transformation

An efficient 6 bit encoding scheme for printable characters by table look up