Conference PaperPDF Available

Efficient Compression Scheme for Large Natural Text Using Zipf Distribution

Authors:

Abstract and Figures

Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character's location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.
Content may be subject to copyright.
1st International Conference on Advances in Science, Engineering and Robotics Technology 2019 (ICASERT 2019)
978-1-7281-3445-1/19/$31.00 ©2019 IEEE
Efficient Compression Scheme for Large
Natural Text Using Zipf Distribution
Md. Ashiq Mahmood and K.M. Azharul Hasan
Department of Computer Science and Engineering
Khulna University of Engineering & Technology (KUET)
Khulna, Bangladesh
Email: ashiqmahmoodbipu@gmail.com, azhasan@gmail.com
Abstract Data compression is the way toward modifying,
encoding or changing over the bit structure of data in such a way
that it expends less space. Character encoding is somewhat
related to data compression which represents a character by
some sort of encoding framework. Encoding is the way toward
putting a succession of characters into a specific arrangement for
effective transmission or capacity. Compression of data covers a
giant domain of employments including information
correspondence, information storing and database development.
In this paper we propose an efficient and new compression
algorithm for large natural datasets where any characters is
encoded by 5 bits called 5-Bit Compression (5BC). The algorithm
manages an encoding procedure by 5 bits for any characters in
English and Bangla using table look up. The look up table is
constructed by using Zipf distribution. The Zipf distribution is a
discrete distribution of commonly used characters in different
languages. 8 bit characters are converted to 5 bits by parting the
characters into 7 sets and utilizing them in a solitary table. The
character’s location is then used uniquely encoding by 5 bits. The
text can be compressed by 5BC is more than 60% of the actual
text. The algorithm for decompression to recover the original
data is depicted also. After the output string of 5BC is produced,
LZW and Huffman techniques further compress the output
string. Optimistic performance is demonstrated by our
experimental result.
Keywords
encoding; compression; decompression; 5-bit
compression; compression ratio.
I.
I
NTRODUCTION
Data compression is broadly used and imperative issue in
information science applications because of capacity limit and
exchange speed of different systems [1][2]. There are
undeniable advantages which may be accomplished
incorporates diminishing capacity necessity and data
transmission, encoding littler bit numbers, short time essential
for broadcasting, powerful channel utilization etc. [3][5].
Having a decent packed scheme conspire remapping any text
by shorter bit is vital [1][2][5]. So a new system of encoding
can be important representing to a source of data precisely
conceivable utilizing least bit numbers putting the actual
meaning unaltered [1][2][11]. In many cases, to reestablish
data into the original organization because of the compacted
structure is vital [6]. Zipf distribution is used in this paper to
construct the dictionary of characters. It actually came from
Zipf’s law. Zipf's law can be said to an experimental law
formulated utilizing mathematical statistics that alludes to the
way that numerous sorts of data studied in the physical and
social sciences can be approximated with a Zipf distribution.
Zipf's law expresses that given some corpus of natural
language utterances, the frequency of any word is conversely
corresponding to its position in the frequency table [4]. In this
way the most frequent word will happen approximately twice
as regularly as the second most frequent word, three times as
regularly as the third most frequent word. The compression
techniques of data are ordered extensively as lossy and
lossless [8]. Lossy technique comprises towards a change to
imperative against immaterial data, trailed by compression of
lossless data in the essential case and disposing rest of the
data. As a matter of fact, Lossy technique functions by
decreasing amount of document by reducing superfluous text
or data [1][11]. The 5BC algorithm which we proposed is
lossless [1][2][8]. The technique deals with the mapping of
both forward and backward. 5BC alters characters from 8 bit
to 5 bits by parting the characters into 7 sets and utilizing them
in a solitary table. By using Zipf distribution, the characters in
the table i.e. dictionary are mapping [4][12]. Same characters
in the corresponding set code are kept together since it
produces shorter sequence of bit code. The 5BC can be able
to compress data by more than 60% of the actual data. 5BC
compression technique is a logical scheme that shows
promising efficiency.
II.
RELATED
WORKS
The Huffman technique is supposed to be a lossless technique
which relies on entropy [9]. It actually construct some binary
trees so the codes can be found from the binary tree sequence.
Huffman algorithm is said to be the variable length encoding
as it mainly works with length of the variable. In the contrary,
our 5BC algorithm constantly settled to work with 5 bits as it
were. The rule used in this technique is utilizing progressively
visit each data via a shorter amount of bits for encoding. In
the records of JPEG, Huffman technique is used. JPEG 2000
(JP2) is supposed to be an image compression standard and
coding system [13]. Static Huffman and Adaptive Huffman
are the classification of Huffman algorithm [9]. Static
Huffman ascertain frequencies right off bat and afterward
they create some typical tree for the forward and backward
mapping [9][15] though Adaptive Huffman functions by
building up trees by figuring frequencies thus constructs two
trees in both of these running processes [14]. There exists
some compression techniques which are based on dictionaries
rather than factual organizations [13][15]. A well-known
dictionary based technique is Lempel-Zev Welch or LZW
technique [9]. In this algorithm, an individual string αK is
constructed by a string α including a K which can be found
through the dictionaries. α is additionally added into the
dictionary [10]. At the end, the character of string has
additionally supplanted again [5]. The existing dictionary
functions in form of dynamic information thus it doesn’t show
to be static [10]. This crude dictionary can be recuperated
through the decompressed information at the time of
disentangling it. The LZW dictionary can be precisely worked
amid the definite specific during Compression and
decompression, and disposed of in the wake of packing or
process of decompressing has been ended [10]. There is
another technique named LZ Technique [5][13] which is
additionally popular because of it’s straightforwardness as
well as It has effective compression proportions. Our
technique utilizes a table look up which remains a static one
and it will not change during the arrival of new data. The look
up table i.e. the dictionary remains fixed and can be utilized
for the purpose of compression and decompression as time
required.
III.
EFFICIENT
COMPRESSION
SCHEME
FOR
LARGE
NATURAL
T
EXT
Generally, for the purpose of encoding a character in a
memory, 8 bit space in a memory is required [1][2]. We
introduce a character encoding scheme of 5 bit called 5BC
representing a character via 5 bits in lieu of 8 bits. The
algorithm works with all the general printable characters
covering both English and Bangla characters that has in
normal keyboard. A lookup table is built for representing the
Characters by 5 bit. The table is illustrated in TABLE I.
Using 5 bit it can be  combination possible. Among
the 32 combinations 25 combinations are used for
compression and the rest of the 7 are used for set
representation. The character sets are splitted within the
number of 7 sets. In the lookup table,
Bangla characters are placed in Set1, Set2, Set3
English characters are placed in Set4, Set5, Set6 &
Set7
7 binary combination are taken for the 7 set.
The Unicode takes 16 bit of memory for representing the
natural language character set. This modified 5 bit characters
will be converted to 16 bit Unicode characters.
TABLE I. Lookup table for 5BC
Serial
no.
Decimal
value
Binary
value
Set1
Set2
Set3
Set4
Set5
Set7
1
0
00000
A
Space
X
2
1
00001
B
,
Y
3
2
00010
C
a
Z
4
3
00011
D
b
x
5
4
00100

E
c
y
6
5
00101

F
d
z
7
6
00110

G
e
'
8
7
00111

H
f
!
9
8
01000

I
g
"
10
9
01001

J
h
#
11
10
01010

K
i
\
12
11
01011
L
j
~
13
12
01100
M
k
^
14
13
01101
N
l
|
15
14
01110
O
m
$
16
15
01111
P
n
:
17
16
10000
Q
o
;
18
17
10001
R
p
_
19
18
10010

S
q
`
20
29
10011
T
r
21
20
10100

U
s
22
21
10101

V
t
23
22
10110

W
u
24
23
10111

.
v
25
24
11000

?
w
26
25
11001
Set-1
27
26
11010
Set-2
28
27
11011
Set-3
29
28
11100
Set-4
30
29
11101
Set-5
31
30
11110
Set-6
32
31
11111
Set-7
Algorithm 1: Forward Mapping
Input: Normal string S,
Output: An encoded compressed string Sc
1: By adding set change,
2: Representing the string S by S`
3: By using the lookup table,
4: Representing the string S` by its corresponding 5 bits
representation.
5: The representative string S’ contain K bits.
Example 1:
Original Text (Input):
I am Bangladeshi
Set Representation: Set4 I Set5 space am space Set4 B
Set5 angladeshi
Decimal Representation: 28 8 29 0 2 14 0 28 1 29 2 15
8 13 2 5 6 20 9 10
5 bit representation: 11100 01000 11101 00000 00010
01110 00000 11100 00001 ….…
16 Bit representation: 1110001000111010
0000000100111000 …………
Compressed String: ǐ
Example 2:
Compressed String: ǐ
16 Bit representation:
--- 1110001000111010
--- 0000000100111000
After dividing it by 16
The 5 bit Representation:
11100 --- 28 --- Set4
01000 --- 8 --- I
11101 --- 29 --- space
00000--- 0 ---- Set5 …………..
Decompressed String:
Set4 I Set5 space am space Set4 B Set5 angladeshi
Original Text: I am Bangladeshi
IV.
ANALYTICAL
ANALYSIS
In 5BC, each characters of 8 bits are represented via 5 bits
hence 3 bits are saved. So the probable efficiency of 5BC is
(8-5)/8= 37.5%. But after compressing it with Unicode
scheme the best efficiency would be more than 60%.
We tried to develop a theoretical analysis to calculate the
efficiency precisely. The parameters which are considered for
analytical analysis are shown in following table.
TABLE II. PARAMETERS FOR ANALYTICAL
EVALUATION
Parameter
Description
I
Original data
E
Encoded data
Ls
Length of the set
Li
Length of I
Ld
Length of E; Ld = Li + Ls
T
Total bytes for I
Te
Total bytes for E
η
Efficiency( Te/T)
9: compare this new C with the next set code,
When C is repeated, the next bit stream will be
Cropped
10: Including the bit code of the repeated C.
11: end if
12: From the remaining Kth bits stream, taking 5 bits
and representing it according to the set of
characters in the lookup table.
13: After excluding the number of set, the actual
string can be generated.
6: if K%16=0 then
7: jump into end
8: end if
9: else if K%16!=0 then
10: Sum up m bits in the formation of the last 5 bit set
code by taking every bit of the set
11: Code one by one and increasing in the same way
until (K+m)%16=0
12: end else if
13: Represent the 16 bits from (K+m)/K to,
14: Compress string Sc by applying corresponding
Unicode text.
Algorithm 2: Backward Mapping
Input: Compressed String Sc
Output: An encoded compressed string Sc
1: Representing the string Sc according to it’s
corresponding 16 bits value of binary number from
the Unicode table.
2: Let, Sc contain K bits.
3: For determining 5 bit representation,
4: At first find    .
5: Store the 1st set code of K in C.
6: Take every 5 bit set code of the K’ bit and
compare it with C.
7: if C is not repeated then
8: which means there is an occurrence of set
change so C is updated with the new 5 bit set code and
Let Input data I has length Li. After encoding it becomes
E and length become Ld
Where     
Total bytes for I,     
Total bytes for E,     
     
  

  
  
  
    
  
  
  

   
  

 
  
  
  
For usable compression,    then
  
     
 
       
 
    
       
    

  
  


 

It is ac tual ly the proportion between the set length and
original length. It demonstrates that this technique is properly
usable until the length of set remains equal or less than 3/5th
of the actual d a t a length. And if so then we will get a
positive efficiency. But if the ratio is greater than 3/5th the
efficiency would be negative.
V.
E
XPERIMENTAL
R
ESULT
Since 5BC produces a large Unicode string in a printable
format. It can be further compressed after applying them
in LZW and Huffman. We actually experiment 5
algorithms namely, 5BC, LZW, Huffman, 5BC+LZW and
5BC+Huffman. The experimental result shows the
comparison between this 5 algorithms. We have done 3
kind of test comprising the Compression of files,
Compression ratio and Compression time.
Compression of files: Here we can see the comparison
between the compressions sizes achieved after applying
the above mentioned compression techniques on the
different datasets.
Compression ratio: Here we can see the comparison
between the compressions ratios achieved after applying
the above mentioned compression techniques on the
different datasets
Compression time: Here we can see the comparison
between the compressions times achieved after applying
the above mentioned compression techniques on the
different datasets.
A. Compression of files:
Fig. 1. Compression of file for different data set
From Fig. 1. The result shows that LZW provides the best
result in terms of compression of different size file. But our
proposed 5BC algorithm also provides very promising result.
Our proposed technique performs far better than the well-
known Huffman algorithm. And another fact that in terms of
increasing data or increasing the file size our technique will
converge the compression result of LZW. And if we combine
these two techniques with 5BC, both the algorithm
5BC+Huffman as well as 5BC+LZW improves their
performance.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
12345678910
FILE SIZE(MB)
NUMBER OF FILES
C O M P RE S S IO N O F F I L ES
File size 5BC LZW
Huffman 5BC+LZW 5BC+Huffman
B. Compression Ratio:
Fig. 2. Demonstrates the compression ratio for different
datasets. The result shows that LZW provides very good
compression ratio at the initial level when the file size is small
but as the size of data is increasing, the efficiency rate will be
poor Because LZW provides bad compression ratio for less
dictionary size and as the character length increases the
dictionary will not easily be made of. So performance
reduced. But our proposed idea provides a constant efficiency
Fig. 2. Compression ratio for different data set
rate at any file size and the performance improves for the
increasing number of datasets. It provides a quite better
efficiency rate than Huffman. And at some point it will
converge the efficiency rate of LZW also. The reason
Huffman provides poor efficiency that Huffman technique
derives the data from the probability or frequency of
occurrence of the possible values in the source symbol [13].
So if the data set is quite large, the frequency generation would
be more difficult and a large number of individual symbol will
be created. As a result, it provides compression ratio which is
average. Another important scenario is that after combining
this both algorithm with 5BC, the performance of Huffman
will be dramatically improved because this time it works with
the compressed data by the 5BC which is much smaller than
the original data. The performance of LZW is also improved.
C. Compression time:
Figure 3 provides the compression times for different
datasets. The result shows that our proposed idea shows
average result time when after combining it with LZW and
Huffman the performance improves drastically. Though
Huffman individually provides the time quite similar with our
proposed idea.
Fig. 3. Compression time for different data set
VI.
CONCLUSION
In this research, we try to propose an efficient encoding
and decoding technique for data compression which
provides a fast and promising performance. This technique
is a 5 bit compression scheme and it’s character set i.e.
dictionary is produced by the Zipf distribution. The
proposed technique provides a promising efficiency rate
which is more than 60%. This technique can be easily
utilized compressing a huge amount of natural datasets.
Both the mapping forward and backward make this
compression mapping complete as it shows i.e. while
decomposing we can achieve the original text. This
technique can be implemented in building an innovative
and efficient software which can be applied in
communication of data, storage of data and database
engineering. It can be utilized to parallel processing
environment as well as load balancing technique to
achieve promising encoding time.
REFERENCES
[1] Ashiq Mahmood, Tarique Latif, and K. M. Azharul Hasan, “An
Efficient 6 bit Encoding Scheme for Printable Characters by table
lookup (6BE)”, International Conference on Electrical, Computer and
Communication Engineering (ECCE), pp. 468-472, 2017.
[2] Md. Ashiq Mahmood, Tarique Latif, K. M. Azharul Hasan and riadul
Islam, “A Feasible 6 Bit Text Database Compression Scheme with
Character Encoding (6BC)”, 2018 21st International Conference of
Computer and Information Technology (ICCIT), pp. 1-6, 2018
[3] Nieves R. Brisaboa, Antonio Fariña Gonzalo Navarro Lightweight
natural language text compression”, Information retrieval 10(1), 1-33,
2007.
[4] Wentian Li Random Texts Exhibit Zipfs-Law-Like Word”, Ieee
Transactions On Informationtheory, Vol. 38, No. 6,November 1992
[5] Md. Abul Kalam Azad, Rezwana Sharmeen, Shabbir Ahmad and S. M.
Kamruzzaman, “An Efficient Technique for Text Compression” The 1st
International Conference on Information Management and Business,
pp. 467-473, 2005.
[6] M. M. Kodabagi, M. V. Jerabandi, Nagaraj Gadagin, “Multilevel
security and compression of text data using bit stuffing and huffman
coding”, 2015 International Conference on Applied and Theoretical
Computing and Communication Technology (iCATccT), pp. 800 - 804,
2015.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 . 0 5 1
0 . 1 1 2
0 . 3 0 3
0 . 5 7 2
0 . 8 1 8
1 . 0 5
1 . 5 5
2 . 5
3 . 8 4
4 . 9 9
EFFICIENCY
FILE SIZE(MB)
RE SU LT O F E FFI CI EN CY
5BC LZW Huffman
5BC+LZW 5BC+Huffman
0
4
8
12
16
20
24
28
32
36
40
0.051
0.112
0.303
0.572
0.818
1 . 0 5
1 . 5 5
2 . 5
3 . 8 4
4 . 9 9
COMPRESSION TIME(HOURS)
FILE LENGTH(MB)
CO MPR ES S I ON T IM E
5BC LZW Huffman
5BC+LZW 5BC+Huffman
[7] Muthukumar Murugesan, T. Ravichandran, “Evaluate Database
Compression Performance and Parallel Backup,” International Journal
of Database Management Systems (IJDMS) Vol.5(4), 17-25, 2013.
[8] Vu H. Nguyen, Hien T. Nguyen, Hieu N. Duong and Vaclav Snasel ”
n-Gram-Based Text Compression, ” Computational Intelligence and
Neuroscience, Volume 2016.
[9] Senthil Shanmuga Sundaram, and Robert Lourdusamy, “A
Comparative Study Of Text Compression Algorithms,” International
Journal of Wisdom Based Computing, Vol.1 (3), pp. 68-76, 2011
[10] A Carus, and A Mesut, “Fast text compression using multiple static
dictionaries,” Information Technology Journal, 1013-1021, 2010.
[11] Replacement And Bit Reductionn”, SIPM, FCST, ITCA, WSE, ACSIT,
CS & IT 06, pp. 407416, 2012.
[12] N. Miladinovic, “Minimum-cost encoding of information. Coding with
digits of unequal cost”, 4th International Conference on
Telecommunications in Modern Satellite, Cable and Broadcasting
Services. TELSIKS'99 (Cat. No.99EX365), pp. 600 603, vol.2, 1999
[13] M. Sangeetha, P. Betty, G.S. Nanda Kumar, A biometrie iris image
compression using LZW and hybrid LZW coding algorithm,” 2017
International Conference on Innovations in Information, Embedded and
Communication Systems (ICIIECS), pp. 1-6, 2013.
[14] Wenjun Huang, Weimin Wang, Hui Xu, “A Lossless Data Compression
Algorithm for Real-time Database,”. 2006 6th World Congress on
Intelligent Control and Automation, , pp. 6645-6648, 2006
[15] Ahmed Mokhtar, A. Mansour, Mona A. M. Fouad, Dictionary based
optimization for adaptive compression techniques”, 2013 36th
International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO), pp. 421 425,
2013
... In natural language libraries, Zipf's law [17] is defined as the frequency of a word's appearance is inversely proportional to its ranking in the frequency table. For a bunch of different words with the possibility of repeating each word, if the usage frequency of the word is ranked in descending order and the ranking is then numbered, the frequency of the word occurrence is a power function of the ranking number. ...
Article
Full-text available
In today’s network information confrontation, due to security reasons, the protocols used by both parties are often undisclosed and the protocol format is unknown, and the communication data is in the form of the continuous and irregular bitstream. How to extract features without prior knowledge is an urgent problem to be solved. Therefore, this study proposes a method for the feature extraction of unknown protocol data frames based on the particle swarm optimization (PSO) algorithm to address the problem of low adaptability and low accuracy of frequent thresholds. Given the features of the bitstream data frames, the proposed method segments the bitstream data through Zipf’s law. The PSO algorithm is employed to adapt the frequent threshold to the uncertainty of the unknown protocols, and the short frequent sequence is then obtained under the adaptive threshold. The continuous location information is then applied to splice the excavated short frequent sequences to determine the final frequent sequence set. To filter out the effective association rules, the chi-squared test is conducted to analyze the association rules mined between frequent sequences. According to the simulation results, the proposed method managed to achieve the frequent extraction of adaptive thresholds in different datasets, whereas its accuracy was higher than that of the comparison algorithm. Moreover, the method proposed in this paper has certain practical significance for theoretical research and application in this field.
Article
B-Tree based text indexes used in MongoDB are slow compared to different structures such as inverted indexes. In this study, it has been shown that the full-text search speed can be increased significantly by indexing a structure in which each different word in the text is included only once. The Multi-Stream Word-Based Compression Algorithm (MWCA), developed in our previous work, stores word dictionaries and data in different streams. While adding the documents to a MongoDB collection, they were encoded with MWCA and separated into six different streams. Each stream was stored in a different field, and three of them containing unique words were used when creating a text index. In this way, the index could be created in a shorter time and took up less space. It was also seen that Snappy and Zlib block compression methods used by MongoDB reached higher compression ratios on data encoded with MWCA. Search tests on text indexes created on collections using different compression options shows that our method provides 19 to 146 times speed increase and 34% to 40% less memory usage. Tests on regex searches that do not use the text index also shows that the MWCA model provides 7 to 13 times speed increase and 29% to 34% less memory usage.
Conference Paper
Full-text available
Character encoding implies representing a repertoire of characters by some sort of encoding framework. Encoding a character in a compelling procedure is in every case estimable in light of the fact that it requires a couple of bits and least investment for information. It has an enormous region of utilization including data correspondence, data stockpiling, transmission of textual information and database innovation. In this paper, a new compression technique is proposed for text data which encodes a character by 6 bits to be specific 6-Bit Text database Compression (6BC). This strategy works with a system of encoding by 6 bit for characters which are printable by utilizing a lookup table. 8 bit characters are converted into 6 bit by this procedure and it partitions the characters into 4 sets. At that point, it utilizes the location of the characters uniquely to encode it by 6 bit. This strategy is likewise utilized in database innovation by compressing the text data in a connection of a database. With the assistance of a lookup table, 6BC can compress and in addition decompress the original data. Reverse procedure for decompression to get back the original data is additionally detailed. The result of 6BC is further applied to compress by the known algorithm to be specific Huffman and LZW. Promising efficiency is appeared by our experimental result. The procedure is further demonstrated by some examples and descriptions.
Conference Paper
Full-text available
Encoding a character in an efficient way is always desirable because it takes less bits and less time for data. It has many application areas including data communication, data storage and database technology. In this paper we propose a new compression algorithm for text data that encodes any character by 6 bits called 6-bit encoding (6BE). The scheme deals with an encoding technique by 6 bits for printable characters using table look up. 6BE converts the 8 bit characters to 6 bits by dividing the characters into 5 sets and using them in a single table. The location of character is then used uniquely to encode by 6 bits. The 6BE can compress the text by 25% of the original text. The reverse algorithm for decomposition to get back the original text is also described. 6BE encoding produces another set of characters. The result of 6BE string is further compressed by Huffman and LZW algorithm. Our experimental result shows promising performance. Examples and descriptions are provided to explain the technique.
Article
Full-text available
We propose an efficient method for compressing Vietnamese text using n -gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n -grams and then encodes them based on n -gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n -gram is encoded by two to four bytes accordingly based on its corresponding n -gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n -gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
Article
Full-text available
Globally accessed databases are having massive number of transactions and database size also beenincreased as MBs and GBs on daily basis. Handling huge volume of data’s in real time environment are bigchallenges in the world. Global domain providers like banking, insurance, finance, railway, and manyother government sectors are having massive number of transactions. Generally global providers are usedto take the backup of their sensitive data’s multiple times in a day, also many of the providers used to takethe data backup once in few hours as well. Implementing backup often for huge size of databases withoutlosing single bit of data is a susceptible problem and challengeable task.This research addresses the difficulties of the above problems and provides the solution to how to optimizeand enhance the process for compress the real time database and implement backup of the database inmultiple devices using parallel way. Our proposed efficient algorithms provides the solution to compressthe real time databases more effectively and improve the speed of backup and restore operations.
Article
Full-text available
Data Compression is the science and art of representing information in a compact form. For decades, Data compression has been one of the critical enabling technologies for the ongoing digital multimedia revolution. There are lot of data compression algorithms which are available to compress files of different formats. This paper provides a survey of different basic lossless data compression algorithms. Experimental results and comparisons of the lossless compression algorithms using Statistical compression techniques and Dictionary based compression techniques were performed on text data. Among the statistical coding techniques the algorithms such as Shannon-Fano Coding, Huffman coding, Adaptive Huffman coding, Run Length Encoding and Arithmetic coding are considered. Lempel Ziv scheme which is a dictionary based technique is divided into two families: those derived from LZ77 (LZ77, LZSS, LZH and LZB) and those derived from LZ78 (LZ78, LZW and LZFG). A set of interesting conclusions are derived on their basis.
Article
Full-text available
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11% larger compressed files. This work describes End-Tagged Dense Code and (s,c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
Conference Paper
Data security and compression is the common requirement for most of the storage and transmission related applications. In this paper, a new method for multilevel security and compression of text data using bit stuffing and Huffman coding is presented. The proposed method comprises various processing modules for securing and compressing the text data through encryption process. The encryption process involves three phases to encrypt the data. In the first phase, the bits of every eighth character in the text will be embedded in the most significant bit positions of preceding seven characters. This process creates an encrypted text data which will be further processed for second level encryption. During second phase, every character of encrypted text is XORed with secret key provided by the sender that authenticates the sender. In the third phase, final encryption is done using Huffman encoding method. In decryption process, the ciphertext passes through three phases to reconstruct the original text. The encrypted data is decrypted using Huffman decoding method in the first phase. Then, the decrypted data and secret key that authenticates the receiver are used to reconstruct the original text in the last two phases of decryption process. Hence, the method provides multistage security for text data and also compresses the text data to larger extent. The results obtained are highly encouraging and the system is very effective in providing high level security and higher shrinkage or reduction of memory that result in reduction of bandwidth and transmission time. An average percentage of shrinkage or reduction of memory is 45.41%.
Conference Paper
This paper proposes an enhancement of the two levels dictionary based compression. The enhancement is based on optimizing the mapping tables overhead of the second level compression, usually variable length code, that used over the first level compression (the dictionary based one). The case study will be applied on 'A-M' compression and will introduce the 'A-M' dictionary version 2. This work guarantees the reduction of overhead from the second level compression up to 45% of the original mapping table. Besides, the use of 'A-M' dictionary reduces the processing effort of recreation of dynamic dictionary. The idea is based on reducing the dictionary field of the mapping table by sorting, grouping, and then eliminating common parts (6 Most Significant Bits - MSB) of the dictionary words during compression, which is reconstructed back during decompression. The process is completely transparent with respect to the two levels of compression (dictionary and variable length) and can be applied not only for the 'A-M' compression, but also for any other dictionary based compression technique.
Article
Real-time database applied in process industry requests the performance of mass data and high speed. So the process data in database must be compressed effectively and reliably. According to process data characteristic, a lossless compression algorithm was designed based on LZW algorithm and RLE algorithm. The compression algorithm first classified process data by characteristic, and then different compression methods were designed for all kinds of data. In order to increase compression radio, pretreatment approaches were implemented before compression. The compression algorithm solved the difficulties of low compression radio and compression speed. Performance test shows that this algorithm obviously improved the real-time performance and efficiency when accessing the process database. It will be widely applied in MES and PCS